[
https://issues.apache.org/jira/browse/HUDI-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357513#comment-17357513
]
manasa commented on HUDI-1827:
------------------------------
[~nishith29] Please find the implemented logic below :
To begin with i have also pulled in files which were part of the PR - ORC
Reader and writer Implementation( as it has ORC - Avro conversion utility
methods)
https://github.com/apache/hudi/pull/2999/files
SparkBootstrapCommitActionExecutor is tightly coupled with parque file .So
modified the same to handle both parquet and orc file. Based on file extension
switched logic between orc and parquet
SparkBootstrapCommitActionExecutor has internal dependency of
HoodieSparkBootstrapSchemaProvider ( It converts parquet schema to avro) which
is also tightly coupled with parquet file.
Modified HoodieSparkBootstrapSchemaProvider to support both orc and parquet
with the help of AvroOrcUtils.createAvroSchema ( part of already raised PR- Orc
reader
and writer implementation)
Tested the boostrap functionality as done in below TestBootstrp class.
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/TestBootstrap.java]
However while testing the same functionality using dataframe in spark-shell ,
noticed that the Bootstraputil's getAllLeafFoldersWithFiles method is also
tightly
coupled with parquet as default base file format is parquet
final String baseFileExtension =
metaClient.getTableConfig().getBaseFileFormat().getFileExtension();
public static final String HOODIE_BASE_FILE_FORMAT_PROP_NAME =
"hoodie.table.base.file.format";
So wanted to check best way to pass the above base file format for dataframe
operations in spark-shell.
I currently see only below configuration for bootstrap functionality
val bootstrapDF = spark.emptyDataFrame
bootstrapDF.write
.format("hudi")
.option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY,
DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "datestr")
.option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath)
.option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS,
classOf[SimpleKeyGenerator].getName)
.mode(SaveMode.Overwrite)
.save(basePath)
Also in the below PR, looks like the paramaeter(TABLE_BASE_FILE_FORMAT) has
been added to HoodieWriteConfig as well.
I do not see any examples of its usage.
[https://github.com/apache/hudi/pull/1512/files]
> Add ORC support in Bootstrap Op
> -------------------------------
>
> Key: HUDI-1827
> URL: https://issues.apache.org/jira/browse/HUDI-1827
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Storage Management
> Reporter: Teresa Kang
> Assignee: manasa
> Priority: Major
>
> SparkBootstrapCommitActionExecutor assumes parquet format right now, need to
> support ORC as well.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)