[ 
https://issues.apache.org/jira/browse/PARQUET-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shyam narayan singh updated PARQUET-1575:
-----------------------------------------
    Summary: Parquet reader throws error "Reading past RLE/BitPacking stream" 
for parquet file with null values  (was: Parquet reader throws error "Reading 
past RLE/BitPacking stream")

> Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet 
> file with null values
> --------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-1575
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1575
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: shyam narayan singh
>            Priority: Major
>
> Recently moved from parquet 1.8.x to 1.12 recently.
> Dataset has > 20k null values to be written to a complex type. Earlier with 
> 1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet 
> - 1414). Writing nulls to complex types has been optimised to be cached (null 
> cache) that would be flushed on next non null encounter or explicit 
> flush/close. With 1.8, it would have encountered explicit close and flush the 
> null cache and write the page. But with 1.12, after encountering 20k values, 
> the page is written prematurely.
>  
> Below is the metadata dump in both cases.
> 1.8 :
> index._id TV=111396 RL=0 DL=2 
> ---------------------------------------------------------------------------- 
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not 
> defined] SZ:8 VC:111396
>  
> 1.12 :
> index._index TV=111396 RL=0 DL=2 
> ---------------------------------------------------------------------------- 
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4 
> VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this 
> column] SZ:8 VC:111396
> All the pages in 1.12 except the last page have same metadata. Now the issue 
> is when the parquet reader kicks in, it sees that the RLE is bit packed and 
> reads 8 bytes which goes beyond the stream as the size is only 4 (Reading 
> past RLE/BitPacking stream).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to