Vitali Makarevich created HUDI-7874:
---------------------------------------
Summary: Fail to read 2-level structure Parquet
Key: HUDI-7874
URL: https://issues.apache.org/jira/browse/HUDI-7874
Project: Apache Hudi
Issue Type: Bug
Reporter: Vitali Makarevich
If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}}
explicitly set - to being able to write nulls inside arrays(the only way), Hudi
starts to write Parquets with the following schema inside:
{{ required group internal_list (LIST) \{
repeated group list {
required int64 element;
}
}}}
But if I had some files produced before setting
{{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they
have the following schema inside
{{ required group internal_list (LIST) \{
repeated int64 array;
}}}
And Hudi 0.14.x at least fails to read records from such file - failing with
exception
{{Caused by: java.lang.RuntimeException: Null-value for required field: }}
Even though the contents of arrays is {{{}not null{}}}(it cannot be null in
fact since Avro requires {{spark.hadoop.parquet.avro.write-old-list-structure}}
= {{false}} to write {{{}null{}}}s.
h3. Expected behavior
Taken from Hudi 0.12.1(not sure what exactly broke that):
# If I have a file with 2 level structure and update(not matter having nulls
inside array or not - both produce the same) arrives with
"spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it
into 3 level.({*}fails in 0.14.1{*})
# If I have 3 level structure with nulls and update cames(not matter with
nulls or without) - read and write correctly
The simple reproduction of issue can be found here:
[https://github.com/VitoMakarevich/hudi-issue-014]
Highly likely, the problem appeared after Hudi made some changes, so values
from Hadoop conf started to propagate into Reader instance(likely they were not
propagated before).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)