shyam narayan singh created PARQUET-1575:
--------------------------------------------

             Summary: Parquet reader throws error "Reading past RLE/BitPacking 
stream"
                 Key: PARQUET-1575
                 URL: https://issues.apache.org/jira/browse/PARQUET-1575
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.12.0
            Reporter: shyam narayan singh


Recently moved from parquet 1.8.x to 1.12 recently.

Dataset has > 20k null values to be written to a complex type. Earlier with 
1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet - 
1414). Writing nulls to complex types has been optimised to be cached (null 
cache) that would be flushed on next non null encounter or explicit 
flush/close. With 1.8, it would have encountered explicit close and flush the 
null cache and write the page. But with 1.12, after encountering 20k values, 
the page is written prematurely.

 

Below is the metadata dump in both cases.

1.8 :

index._id TV=111396 RL=0 DL=2 
---------------------------------------------------------------------------- 
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not 
defined] SZ:8 VC:111396

 

1.12 :

index._index TV=111396 RL=0 DL=2 
---------------------------------------------------------------------------- 
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4 
VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this 
column] SZ:8 VC:111396

All the pages in 1.12 except the last page have same metadata. Now the issue is 
when the parquet reader kicks in, it sees that the RLE is bit packed and reads 
8 bytes which goes beyond the stream as the size is only 4 (Reading past 
RLE/BitPacking stream).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to