[ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenning Ding updated HUDI-1376:
-------------------------------
    Description: 
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns are not used during the Spark datasource writing 
process, we can drop those columns in the beginning.

  was:
When updating a Hudi table through Spark datasource, it will use the schema of 
the input dataframe as the schema stored in the commit files. Thus, when 
upserted with rows containing metadata columns, the upsert commit file will 
store the metadata columns schema in the commit file which is unnecessary for 
common cases. And also this will bring an issue for bootstrap table.

Since metadata columns are not used during the Spark datasource writing 
process, we can drop those columns in the beginning of Spark datasource.


> Drop Hudi metadata columns before Spark datasource writing 
> -----------------------------------------------------------
>
>                 Key: HUDI-1376
>                 URL: https://issues.apache.org/jira/browse/HUDI-1376
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Wenning Ding
>            Assignee: Wenning Ding
>            Priority: Major
>              Labels: pull-request-available
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns are not used during the Spark datasource writing 
> process, we can drop those columns in the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to