[ 
https://issues.apache.org/jira/browse/HUDI-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357513#comment-17357513
 ] 

manasa commented on HUDI-1827:
------------------------------

[~nishith29] Please find the implemented logic  below :
To begin with i have also pulled in files which were part of the PR - ORC 
Reader and writer Implementation( as it has ORC - Avro conversion utility 
methods)
https://github.com/apache/hudi/pull/2999/files

SparkBootstrapCommitActionExecutor is tightly coupled with parque file .So 
modified the same to handle both parquet and orc file. Based on file extension 
switched logic between orc and parquet

SparkBootstrapCommitActionExecutor has internal dependency of 
HoodieSparkBootstrapSchemaProvider ( It converts parquet schema to avro) which 
is also tightly coupled with parquet file.
Modified HoodieSparkBootstrapSchemaProvider to support both orc and parquet 
with the help of AvroOrcUtils.createAvroSchema ( part of already raised PR- Orc 
reader
and writer implementation)

Tested the boostrap functionality as done in below TestBootstrp class.
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/client/TestBootstrap.java]

 

However while testing the same functionality using dataframe in spark-shell , 
noticed that the Bootstraputil's getAllLeafFoldersWithFiles method is also 
tightly 
coupled with parquet as default base file format is parquet

final String baseFileExtension = 
metaClient.getTableConfig().getBaseFileFormat().getFileExtension();
 public static final String HOODIE_BASE_FILE_FORMAT_PROP_NAME = 
"hoodie.table.base.file.format";

So wanted to check best way to pass the above base file format for dataframe 
operations in spark-shell.

I currently see only below configuration for bootstrap functionality

val bootstrapDF = spark.emptyDataFrame
bootstrapDF.write
 .format("hudi")
 .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
 .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
 .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
 .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "datestr")
 .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, srcPath)
 .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
classOf[SimpleKeyGenerator].getName)
 .mode(SaveMode.Overwrite)
 .save(basePath)

Also in the below PR, looks like the paramaeter(TABLE_BASE_FILE_FORMAT) has 
been added to HoodieWriteConfig as well. 
I do not see any examples of its usage.

[https://github.com/apache/hudi/pull/1512/files]

 

> Add ORC support in Bootstrap Op
> -------------------------------
>
>                 Key: HUDI-1827
>                 URL: https://issues.apache.org/jira/browse/HUDI-1827
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Storage Management
>            Reporter: Teresa Kang
>            Assignee: manasa
>            Priority: Major
>
> SparkBootstrapCommitActionExecutor assumes parquet format right now, need to 
> support ORC as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to