[
https://issues.apache.org/jira/browse/HUDI-5828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691819#comment-17691819
]
sivabalan narayanan edited comment on HUDI-5828 at 2/21/23 11:07 PM:
---------------------------------------------------------------------
As per our quick start guide,
we have 5 configs that are required.
1. shuffle parallelism.
2. record key
3. partition path
4. precombine
5. table name
1: with 0.13.0, we have already relaxed and this is not a mandatory field. It
wasn't mandatory even before, but with 0.13.0, our parallelism is dynamically
derived from the incoming df
2: With auto generation of record keys support, we should be able to relax this
constraint.
3: We are adding support to infer partition from the incoming df's with
https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But
some follow up is required though. for non-partitioned, we need to infer that
incoming df is non-partitioned and choose NonPartitioned as key gen class. If
not, default key gen class is SimpleKeyGen. But this might work w/o any
additional fixes for simple partition path.
4. preCombine: this is already an optional field and users don't need to supply
them.
5: table name: This is somewhat tricky.
We can auto generate some hudi table name, but when hive sync is enabled, we
should not generating it automatically. Since with external metastores, no two
tables will have same name and they should have meaningful names, we can't auto
generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf
etc. So, here is what we can do.
User flow1:
For a user who uses just spark ds to write and read.
a. Auto generate hoodie.table.name if user does not supply one. The auto
generated table name will get serialized into hoodie.properties.
User flow2:
User who writes via spark and syncs to hive on every commit.
User does not need to supply hoodie.table.name. But user is expected to set
explicit value for "hoodie.datasource.hive_sync.table". So, auto generated
table name will get serialized into hoodie.properties, but for hive sync
purposes, we will choose what user explicitly set for the corresponding config.
User flow3:
Similar to flow2.
User writes via spark and syncs to hive in a standalone manner and not w/ every
write.
Regular writes will proceed as usual, where we will genreate hudi table name
automatically for the first time.
When syncing to external metastore, user has to explicitly set value for
"hoodie.datasource.hive_sync.table".
Note: For case 2 and 3:
if user explicitly sets value for "hoodie.table.name", we should automatically
infer for "hoodie.datasource.hive_sync.table". Only if user has not explicitly
set "hoodie.table.name" and the name was programmatically auto generated, user
has to explicitly set value for "hoodie.datasource.hive_sync.table".
Format for hoodie table name to auto generate:
hoodie_table_\{ts}_\{randomInt}
where ts is current timestamp
and we will also generate a random Integer to accommodate any concurrent
writer.
Summary:
So, putting all of these together, here is where we will stand.
df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)
Special handling:
We could even further simplify if need be.
We can deduce that user has not provided any configs (0 user supplied configs)
and in such cases, we can choose the default value for
"hoodie.datasource.write.recordkey.autogen" as true and proceed instead of
failing. This is somewhat synonymous to how we might set the default key gen
type to Simple or NonPartitioned.
was (Author: shivnarayan):
As per our quick start guide,
we have 5 configs that are required.
1. shuffle parallelism.
2. record key
3. partition path
4. precombine
5. table name
1: with 0.13.0, we have already relaxed and this is not a mandatory field. It
wasn't mandatory even before, but with 0.13.0, our parallelism is dynamically
derived from the incoming df
2: With auto generation of record keys support, we should be able to relax this
constraint.
3: We are adding support to infer partition from the incoming df's with
https://issues.apache.org/jira/browse/HUDI-5796. So, thats taken care of. But
some follow up is required though. for non-partitioned, we need to infer that
incoming df is non-partitioned and choose NonPartitioned as key gen class. If
not, default key gen class is SimpleKeyGen. But this might work w/o any
additional fixes for simple partition path.
4. preCombine: this is already an optional field and users don't need to supply
them.
5: table name: This is somewhat tricky.
We can auto generate some hudi table name, but when hive sync is enabled, we
should not generating it automatically. Since with external metastores, no two
tables will have same name and they should have meaningful names, we can't auto
generate. If not, the table names will be hudi_12313, hudi_e5e44, hudi_45sadf
etc. So, here is what we can do.
User flow1:
For a user who uses just spark ds to write and read.
a. Auto generate hoodie.table.name if user does not supply one. The auto
generated table name will get serialized into hoodie.properties.
User flow2:
User who writes via spark and syncs to hive on every commit.
User does not need to supply hoodie.table.name. But user is expected to set
explicit value for "hoodie.datasource.hive_sync.table". So, auto generated
table name will get serialized into hoodie.properties, but for hive sync
purposes, we will choose what user explicitly set for the corresponding config.
User flow3:
Similar to flow2.
User writes via spark and syncs to hive in a standalone manner and not w/ every
write.
Regular writes will proceed as usual, where we will genreate hudi table name
automatically for the first time.
When syncing to external metastore, user has to explicitly set value for
"hoodie.datasource.hive_sync.table".
Format for hoodie table name to auto generate:
hoodie_table_\{ts}_\{randomInt}
where ts is current timestamp
and we will also generate a random Integer to accommodate any concurrent
writer.
Summary:
So, putting all of these together, here is where we will stand.
df.write.format("hudi").option("hoodie.datasource.write.recordkey.autogen","true").save(path)
Special handling:
We could even further simplify if need be.
We can deduce that user has not provided any configs (0 user supplied configs)
and in such cases, we can choose the default value for
"hoodie.datasource.write.recordkey.autogen" as true and proceed instead of
failing. This is somewhat synonymous to how we might set the default key gen
type to Simple or NonPartitioned.
> Support df.write.forma("hudi") with out any additional options
> --------------------------------------------------------------
>
> Key: HUDI-5828
> URL: https://issues.apache.org/jira/browse/HUDI-5828
> Project: Apache Hudi
> Issue Type: Improvement
> Components: writer-core
> Reporter: sivabalan narayanan
> Priority: Major
>
> Wrt simplifying the usage of hudi among more users, we should try to see if
> we can support writing to hudi w/o any options during write.
>
> For eg, we can do the following with paruqet writes.
> {code:java}
> df.write.format("parquet").save(path)
> {code}
>
> So, for a non-partitioned dataset, we should try if we can support this
> usability.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)