shyam narayan singh created PARQUET-1575:
--------------------------------------------
Summary: Parquet reader throws error "Reading past RLE/BitPacking
stream"
Key: PARQUET-1575
URL: https://issues.apache.org/jira/browse/PARQUET-1575
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.12.0
Reporter: shyam narayan singh
Recently moved from parquet 1.8.x to 1.12 recently.
Dataset has > 20k null values to be written to a complex type. Earlier with
1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet -
1414). Writing nulls to complex types has been optimised to be cached (null
cache) that would be flushed on next non null encounter or explicit
flush/close. With 1.8, it would have encountered explicit close and flush the
null cache and write the page. But with 1.12, after encountering 20k values,
the page is written prematurely.
Below is the metadata dump in both cases.
1.8 :
index._id TV=111396 RL=0 DL=2
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not
defined] SZ:8 VC:111396
1.12 :
index._index TV=111396 RL=0 DL=2
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4
VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this
column] SZ:8 VC:111396
All the pages in 1.12 except the last page have same metadata. Now the issue is
when the parquet reader kicks in, it sees that the RLE is bit packed and reads
8 bytes which goes beyond the stream as the size is only 4 (Reading past
RLE/BitPacking stream).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)