[ 
https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781231#comment-17781231
 ] 

ASF GitHub Bot commented on DRILL-8458:
---------------------------------------

handmadecode commented on PR #2838:
URL: https://github.com/apache/drill/pull/2838#issuecomment-1786560246

   PR updated with refactored test code




> Reading Parquet v2 data page with repetition levels larger than column data 
> throws IllegalArgumentException
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-8458
>                 URL: https://issues.apache.org/jira/browse/DRILL-8458
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.21.1
>            Reporter: Peter Franzen
>            Assignee: James Turton
>            Priority: Major
>             Fix For: 1.22.0
>
>
> When the size of the repetition level bytes in a Parquet v2 data page is 
> larger than the size of the column data bytes, 
> {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}}
>  throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set 
> the limit of a ByteBuffer to a value large than its capacity.
>  
> The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}:
>  
> {code:java}
> 217 int pageBufOffset = 0;
> 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 219 BytesInput repLevelBytes = BytesInput.from(
> 220   (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize)
> 221 );
> 222 pageBufOffset += repLevelSize;
> 223
> 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset);
> 225 final BytesInput defLevelBytes = BytesInput.from(
> 226   (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize)
> 227 );
> 228 pageBufOffset += defLevelSize;  {code}
>  
> The buffer {{pageBuf}} contains the repetition level bytes followed by the 
> definition level bytes followed by the column data bytes.
>  
> The code at lines 217-221 reads the repetition level bytes, and then updates 
> the position of the {{pageBuf}} buffer to the start of the definition level 
> bytes (lines 222 and 224).
>  
> The code at lines 225-227 reads the definition level bytes, and when creating 
> a slice of the \{{pageBuf }}buffer containing the definition level bytes, the 
> slice's limit is set as if the position was at the beginning of the 
> repetition level bytes (line 226), i.e as if it not had been updated.
>  
> This means that if the capacity of the pageBuf buffer (which is the size of 
> the repetition level bytes + the size of the definition level bytes + the 
> size of the column data bytes) is less than (repLevelSize + repLevelSize + 
> defLevelSize), the call to limit() will throw.
>  
> The fix is to change line 226 to
> {code:java}
>   (ByteBuffer) bb.slice().limit(defLevelSize){code}
>  
> For symmetry, line 220 could also be changed to
> {code:java}
>   (ByteBuffer) bb.slice().limit(repLevelSize){code}
>  
> although {{pageBufOffset}} is always 0 there and will not cause the limit to 
> exceed the capacity.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to