[
https://issues.apache.org/jira/browse/PARQUET-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
shyam narayan singh updated PARQUET-1575:
-----------------------------------------
Summary: Parquet reader throws error "Reading past RLE/BitPacking stream"
for parquet file with null values (was: Parquet reader throws error "Reading
past RLE/BitPacking stream")
> Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet
> file with null values
> --------------------------------------------------------------------------------------------------
>
> Key: PARQUET-1575
> URL: https://issues.apache.org/jira/browse/PARQUET-1575
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.12.0
> Reporter: shyam narayan singh
> Priority: Major
>
> Recently moved from parquet 1.8.x to 1.12 recently.
> Dataset has > 20k null values to be written to a complex type. Earlier with
> 1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet
> - 1414). Writing nulls to complex types has been optimised to be cached (null
> cache) that would be flushed on next non null encounter or explicit
> flush/close. With 1.8, it would have encountered explicit close and flush the
> null cache and write the page. But with 1.12, after encountering 20k values,
> the page is written prematurely.
>
> Below is the metadata dump in both cases.
> 1.8 :
> index._id TV=111396 RL=0 DL=2
> ----------------------------------------------------------------------------
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not
> defined] SZ:8 VC:111396
>
> 1.12 :
> index._index TV=111396 RL=0 DL=2
> ----------------------------------------------------------------------------
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4
> VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this
> column] SZ:8 VC:111396
> All the pages in 1.12 except the last page have same metadata. Now the issue
> is when the parquet reader kicks in, it sees that the RLE is bit packed and
> reads 8 bytes which goes beyond the stream as the size is only 4 (Reading
> past RLE/BitPacking stream).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)