Vitali Makarevich created HUDI-7874:
---------------------------------------

             Summary: Fail to read 2-level structure Parquet
                 Key: HUDI-7874
                 URL: https://issues.apache.org/jira/browse/HUDI-7874
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Vitali Makarevich


If I have {{"spark.hadoop.parquet.avro.write-old-list-structure", "false"}} 
explicitly set - to being able to write nulls inside arrays(the only way), Hudi 
starts to write Parquets with the following schema inside:
 {{   required group internal_list (LIST) \{
    repeated group list {
      required int64 element;
    }
  }}}
 
But if I had some files produced before setting 
{{{}"spark.hadoop.parquet.avro.write-old-list-structure", "false"{}}}, they 
have the following schema inside
 {{  required group internal_list (LIST) \{
    repeated int64 array;
  }}}
 
And Hudi 0.14.x at least fails to read records from such file - failing with 
exception
{{Caused by: java.lang.RuntimeException: Null-value for required field: }}

Even though the contents of arrays is {{{}not null{}}}(it cannot be null in 
fact since Avro requires {{spark.hadoop.parquet.avro.write-old-list-structure}} 
= {{false}} to write {{{}null{}}}s.
h3. Expected behavior

Taken from Hudi 0.12.1(not sure what exactly broke that):
 # If I have a file with 2 level structure and update(not matter having nulls 
inside array or not - both produce the same) arrives with 
"spark.hadoop.parquet.avro.write-old-list-structure", "false" - overwrite it 
into 3 level.({*}fails in 0.14.1{*})
 # If I have 3 level structure with nulls and update cames(not matter with 
nulls or without) - read and write correctly

The simple reproduction of issue can be found here:
[https://github.com/VitoMakarevich/hudi-issue-014]

Highly likely, the problem appeared after Hudi made some changes, so values 
from Hadoop conf started to propagate into Reader instance(likely they were not 
propagated before).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to