[
https://issues.apache.org/jira/browse/HUDI-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647735#comment-17647735
]
Alexey Kudinkin edited comment on HUDI-5392 at 1/24/23 8:13 AM:
----------------------------------------------------------------
Another contributing issue is that when reading Bootstrap file we don't specify
the expected schema and therefore records from the Bootstrap file are read in
the schema decode from Parquet file. This is problematic b/c when we validate
the Avro schemas their corresponding names are checked and this creates
mismatches since Parquet schemas don't bear names/namespaces (of the structs)
was (Author: alexey.kudinkin):
Another contributing issue is that when reading Bootstrap file we don't specify
the expected schema and therefore records from the Bootstrap file are read in
the schema decode from file's Parquet one. This is problematic b/c when we
validate the Avro schemas their corresponding names are checked and this
creates mismatches since Parquet schemas don't bear names/namespaces (of the
structs)
> Fix Bootstrap files reader to configure arrays to be read in the new format
> ---------------------------------------------------------------------------
>
> Key: HUDI-5392
> URL: https://issues.apache.org/jira/browse/HUDI-5392
> Project: Apache Hudi
> Issue Type: Bug
> Components: bootstrap
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.13.0
>
>
> When writing Bootstrap file we’re using Spark writer that writes arrays in
> the new format, while Hudi reads it in the old (Avro compatible) format:
> {code:java}
> // Old
> optional group tip_history (LIST) {
> repeated group array {
> optional double amount;
> optional binary currency (UTF8);
> }
> }
> // new
> optional group tip_history (LIST) {
> repeated group list {
> optional group element {
> optional double amount;
> optional binary currency (UTF8);
> }
> }
> } {code}
>
> To fix that we need to make sure that Bootstrap files are *always* read in a
> new format (Spark default) unlike Hudi's Parquet files
> We also need to fix TestDataSourceForBootstrap, as it currently doesn't
> actually assert that the records are written correctly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)