[
https://issues.apache.org/jira/browse/HUDI-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-7874:
---------------------------------
Labels: pull-request-available (was: )
> Fail to read 2-level structure Parquet
> --------------------------------------
>
> Key: HUDI-7874
> URL: https://issues.apache.org/jira/browse/HUDI-7874
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Vitali Makarevich
> Priority: Major
> Labels: pull-request-available
>
> If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}}
> explicitly set - to being able to write nulls inside arrays(the only way),
> Hudi starts to write Parquets with the following schema inside:
> {{ required group internal_list (LIST) \{
> repeated group list {
> required int64 element;
> }
> }}}
>
> But if I had some files produced before setting
> {{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they
> have the following schema inside
> {{ required group internal_list (LIST) \{
> repeated int64 array;
> }}}
>
> And Hudi 0.14.x at least fails to read records from such file - failing with
> exception
> {{Caused by: java.lang.RuntimeException: Null-value for required field: }}
> Even though the contents of arrays is {{{}not null{}}}(it cannot be null in
> fact since Avro requires
> {{spark.hadoop.parquet.avro.write-old-list-structure}} = {{false}} to write
> {{{}null{}}}s.
> h3. Expected behavior
> Taken from Hudi 0.12.1(not sure what exactly broke that):
> # If I have a file with 2 level structure and update(not matter having nulls
> inside array or not - both produce the same) arrives with
> "spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it
> into 3 level.({*}fails in 0.14.1{*})
> # If I have 3 level structure with nulls and update cames(not matter with
> nulls or without) - read and write correctly
> The simple reproduction of issue can be found here:
> [https://github.com/VitoMakarevich/hudi-issue-014]
> Highly likely, the problem appeared after Hudi made some changes, so values
> from Hadoop conf started to propagate into Reader instance(likely they were
> not propagated before).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)