Zamil Majdy created PARQUET-2342:
------------------------------------

             Summary: Parquet writer produced a corrupted file due to page 
value count overflow
                 Key: PARQUET-2342
                 URL: https://issues.apache.org/jira/browse/PARQUET-2342
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
            Reporter: Zamil Majdy


Parquet writer only checks the number of rows and the page size to decide 
whether it needs to fit a content to be written in a single page. 

In the case of a composite column (ex: array/map) with a lot of nulls, it is 
possible to create 2billions+ values while under the default page-size & 
row-count threshold (1MB, 20000rows)

 

Repro using Spark:

{{      val dir = "/tmp/anyrandomDirectory"}}

{{      spark.range(0, 20000, 1, 1)}}
{{        .selectExpr("array_repeat(cast(null as binary), 110000) as n")}}
{{        .write}}
{{        .mode("overwrite")}}
{{        .save(dir)}}

{{      val result = spark}}
{{        .sql(s"select * from parquet.`$dir` limit 1000")}}
{{        .collect() // This will break}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to