Zamil Majdy created PARQUET-2342:
------------------------------------
Summary: Parquet writer produced a corrupted file due to page
value count overflow
Key: PARQUET-2342
URL: https://issues.apache.org/jira/browse/PARQUET-2342
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Reporter: Zamil Majdy
Parquet writer only checks the number of rows and the page size to decide
whether it needs to fit a content to be written in a single page.
In the case of a composite column (ex: array/map) with a lot of nulls, it is
possible to create 2billions+ values while under the default page-size &
row-count threshold (1MB, 20000rows)
Repro using Spark:
{{ val dir = "/tmp/anyrandomDirectory"}}
{{ spark.range(0, 20000, 1, 1)}}
{{ .selectExpr("array_repeat(cast(null as binary), 110000) as n")}}
{{ .write}}
{{ .mode("overwrite")}}
{{ .save(dir)}}
{{ val result = spark}}
{{ .sql(s"select * from parquet.`$dir` limit 1000")}}
{{ .collect() // This will break}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)