hudi-bot opened a new issue, #15771:
URL: https://github.com/apache/hudi/issues/15771
Wrt simplifying the usage of hudi among more users, we should try to see if
we can support writing to hudi w/o any options during write.
For eg, we can do the following with paruqet writes.
{code:java}
df.write.format("parquet").save(path)
{code}
So, for a non-partitioned dataset, we should try if we can support this
usability.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-5828
- Type: Improvement
---
## Comments
21/Feb/23 23:01;shivnarayan;As per our quick start guide,
we have 5 configs that are required.
1. shuffle parallelism.
2. record key
3. partition path
4. precombine
5. table name
1: with 0.13.0, we have already relaxed and this is not a mandatory field.
It wasn't mandatory even before, but with 0.13.0, our parallelism is
dynamically derived from the incoming df
2: With auto generation of record keys support, we should be able to relax
this constraint.
3: We are adding support to infer partition from the incoming df's with
https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But
some follow up is required though. for non-partitioned, we need to infer that
incoming df is non-partitioned and choose NonPartitioned as key gen class. If
not, default key gen class is SimpleKeyGen. But this might work w/o any
additional fixes for simple partition path.
4. preCombine: this is already an optional field and users don't need to
supply them.
5: table name: This is somewhat tricky.
We can auto generate some hudi table name, but when hive sync is enabled, we
should not generating it automatically. Since with external metastores, no two
tables will have same name and they should have meaningful names, we can't auto
generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf
etc. So, here is what we can do.
User flow1:
For a user who uses just spark ds to write and read.
a. Auto generate hoodie.table.name if user does not supply one. The auto
generated table name will get serialized into hoodie.properties.
User flow2:
User who writes via spark and syncs to hive on every commit.
User does not need to supply hoodie.table.name. But user is expected to set
explicit value for "hoodie.datasource.hive_sync.table". So, auto generated
table name will get serialized into hoodie.properties, but for hive sync
purposes, we will choose what user explicitly set for the corresponding config.
User flow3:
Similar to flow2.
User writes via spark and syncs to hive in a standalone manner and not w/
every write.
Regular writes will proceed as usual, where we will genreate hudi table name
automatically for the first time.
When syncing to external metastore, user has to explicitly set value for
"hoodie.datasource.hive_sync.table".
Note: For case 2 and 3:
if user explicitly sets value for "hoodie.table.name", we should
automatically infer for "hoodie.datasource.hive_sync.table". Only if user has
not explicitly set "hoodie.table.name" and the name was programmatically auto
generated, user has to explicitly set value for
"hoodie.datasource.hive_sync.table".
Format for hoodie table name to auto generate:
hoodie_table_\{ts}_\{randomInt}
where ts is current timestamp
and we will also generate a random Integer to accommodate any concurrent
writer.
Summary:
So, putting all of these together, here is where we will stand.
df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)
bcoz, the default value for "hoodie.datasource.write.recordkey.autogen" is
false.
Special handling:
We could even further simplify if need be.
We can deduce that user has not provided any configs (0 user supplied
configs) and in such cases, we can choose the default value for
"hoodie.datasource.write.recordkey.autogen" as true and proceed instead of
failing. This is somewhat synonymous to how we might set the default key gen
type to Simple or NonPartitioned.
;;;
---
22/Feb/23 02:45;codope;Regarding point 5 (table name), can we infer table
name from base path passed to df.save()? There cannot be two tables at the same
base path.;;;
---
22/Feb/23 12:00;kazdy;Regarding point 5, in spark there are two apis worth
taking a look at:
df.toTable(tablename, hudi)
df.saveAsTable(tablename, hudi)
when using these, spark creates table in metastore as well.
Wouldn't it be cleaner to start pointing users at saveAsTable() and for
save() api require users to provide table name?
As a Hudi user what Sagar said is also compelling :) ;;;
---
22/Feb/23 22:36;shivnarayan;looks like having tableName as a mandatory
should be ok. For eg, even w/ databases, we have table name as a mandatory
field. So, instead of complicating things(having inference of hive sync table
name in one flow and not in other), better option would be ask users to set the
table name to the minimum.
[~kazdy] : thanks for the pointers. We are taking a look at registering as
table as well. ;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]