hudi-bot opened a new issue, #15771:
URL: https://github.com/apache/hudi/issues/15771

   Wrt simplifying the usage of hudi among more users, we should try to see if 
we can support writing to hudi w/o any options during write. 
   
    
   
   For eg, we can do the following with paruqet writes. 
   {code:java}
   df.write.format("parquet").save(path)
   {code}
    
   
   So, for a non-partitioned dataset, we should try if we can support this 
usability. 
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5828
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   21/Feb/23 23:01;shivnarayan;As per our quick start guide, 
   
   we have 5 configs that are required. 
   
   1. shuffle parallelism.
   
   2. record key 
   
   3. partition path
   
   4. precombine 
   
   5. table name 
   
    
   
   1: with 0.13.0, we have already relaxed and this is not a mandatory field. 
It wasn't mandatory even before, but with 0.13.0, our parallelism is 
dynamically derived from the incoming df 
   
   2: With auto generation of record keys support, we should be able to relax 
this constraint. 
   
   3: We are adding support to infer partition from the incoming df's with 
https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But 
some follow up is required though. for non-partitioned, we need to infer that 
incoming df is non-partitioned and choose NonPartitioned as key gen class. If 
not, default key gen class is SimpleKeyGen. But this might work w/o any 
additional fixes for simple partition path. 
   4. preCombine: this is already an optional field and users don't need to 
supply them. 
   
   5: table name: This is somewhat tricky. 
   
   We can auto generate some hudi table name, but when hive sync is enabled, we 
should not generating it automatically. Since with external metastores, no two 
tables will have same name and they should have meaningful names, we can't auto 
generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf 
etc. So, here is what we can do.
   
    
   
   User flow1: 
   
   For a user who uses just spark ds to write and read. 
   
   a. Auto generate hoodie.table.name if user does not supply one. The auto 
generated table name will get serialized into hoodie.properties. 
   
    
   
   User flow2: 
   
   User who writes via spark and syncs to hive on every commit. 
   
   User does not need to supply hoodie.table.name. But user is expected to set 
explicit value for "hoodie.datasource.hive_sync.table". So, auto generated 
table name will get serialized into hoodie.properties, but for hive sync 
purposes, we will choose what user explicitly set for the corresponding config. 
   
    
   
   User flow3:
   
   Similar to flow2. 
   
   User writes via spark and syncs to hive in a standalone manner and not w/ 
every write. 
   
   Regular writes will proceed as usual, where we will genreate hudi table name 
automatically for the first time. 
   
   When syncing to external metastore, user has to explicitly set value for 
"hoodie.datasource.hive_sync.table". 
   
    
   
   Note: For case 2 and 3: 
   
   if user explicitly sets value for "hoodie.table.name", we should 
automatically infer for "hoodie.datasource.hive_sync.table". Only if user has 
not explicitly set "hoodie.table.name" and the name was programmatically auto 
generated, user has to explicitly set value for 
"hoodie.datasource.hive_sync.table".
   
    
   
   Format for hoodie table name to auto generate: 
   
   hoodie_table_\{ts}_\{randomInt}
   
   where ts is current timestamp 
   
   and we will also generate a random Integer to accommodate any concurrent 
writer. 
   
    
   
    
   
   Summary:
   
   So, putting all of these together, here is where we will stand.
   
   
df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)
   
   bcoz, the default value for "hoodie.datasource.write.recordkey.autogen" is 
false. 
   
    
   
   Special handling:
   
   We could even further simplify if need be. 
   
   We can deduce that user has not provided any configs (0 user supplied 
configs) and in such cases, we can choose the default value for 
"hoodie.datasource.write.recordkey.autogen" as true and proceed instead of 
failing. This is somewhat synonymous to how we might set the default key gen 
type to Simple or NonPartitioned. 
   
    
   
    
   
    
   
    ;;;
   
   ---
   
   22/Feb/23 02:45;codope;Regarding point 5 (table name), can we infer table 
name from base path passed to df.save()? There cannot be two tables at the same 
base path.;;;
   
   ---
   
   22/Feb/23 12:00;kazdy;Regarding point 5, in spark there are two apis worth 
taking a look at: 
   df.toTable(tablename, hudi)
   df.saveAsTable(tablename, hudi) 
   when using these, spark creates table in metastore as well. 
   Wouldn't it be cleaner to start pointing users at saveAsTable() and for 
save() api require users to provide table name? 
   As a Hudi user what Sagar said is also compelling :) ;;;
   
   ---
   
   22/Feb/23 22:36;shivnarayan;looks like having tableName as a mandatory 
should be ok. For eg, even w/ databases, we have table name as a mandatory 
field. So, instead of complicating things(having inference of hive sync table 
name in one flow and not in other), better option would be ask users to set the 
table name to the minimum. 
   
    
   
   [~kazdy] : thanks for the pointers. We are taking a look at registering as 
table as well. ;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to