[ https://issues.apache.org/jira/browse/DRILL-8458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781231#comment-17781231 ]
ASF GitHub Bot commented on DRILL-8458: --------------------------------------- handmadecode commented on PR #2838: URL: https://github.com/apache/drill/pull/2838#issuecomment-1786560246 PR updated with refactored test code > Reading Parquet v2 data page with repetition levels larger than column data > throws IllegalArgumentException > ----------------------------------------------------------------------------------------------------------- > > Key: DRILL-8458 > URL: https://issues.apache.org/jira/browse/DRILL-8458 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet > Affects Versions: 1.21.1 > Reporter: Peter Franzen > Assignee: James Turton > Priority: Major > Fix For: 1.22.0 > > > When the size of the repetition level bytes in a Parquet v2 data page is > larger than the size of the column data bytes, > {{org.apache.parquet.hadoop.ColumnChunkIncReadStore$ColumnChunkIncPageReader::readPage}} > throws an {{{}IllegalArgumentException{}}}. This is caused by trying to set > the limit of a ByteBuffer to a value large than its capacity. > > The offending code is at line 226 in {{{}ColumnChunkIncReadStore.java{}}}: > > {code:java} > 217 int pageBufOffset = 0; > 218 ByteBuffer bb = (ByteBuffer) pageBuf.position(pageBufOffset); > 219 BytesInput repLevelBytes = BytesInput.from( > 220 (ByteBuffer) bb.slice().limit(pageBufOffset + repLevelSize) > 221 ); > 222 pageBufOffset += repLevelSize; > 223 > 224 bb = (ByteBuffer) pageBuf.position(pageBufOffset); > 225 final BytesInput defLevelBytes = BytesInput.from( > 226 (ByteBuffer) bb.slice().limit(pageBufOffset + defLevelSize) > 227 ); > 228 pageBufOffset += defLevelSize; {code} > > The buffer {{pageBuf}} contains the repetition level bytes followed by the > definition level bytes followed by the column data bytes. > > The code at lines 217-221 reads the repetition level bytes, and then updates > the position of the {{pageBuf}} buffer to the start of the definition level > bytes (lines 222 and 224). > > The code at lines 225-227 reads the definition level bytes, and when creating > a slice of the \{{pageBuf }}buffer containing the definition level bytes, the > slice's limit is set as if the position was at the beginning of the > repetition level bytes (line 226), i.e as if it not had been updated. > > This means that if the capacity of the pageBuf buffer (which is the size of > the repetition level bytes + the size of the definition level bytes + the > size of the column data bytes) is less than (repLevelSize + repLevelSize + > defLevelSize), the call to limit() will throw. > > The fix is to change line 226 to > {code:java} > (ByteBuffer) bb.slice().limit(defLevelSize){code} > > For symmetry, line 220 could also be changed to > {code:java} > (ByteBuffer) bb.slice().limit(repLevelSize){code} > > although {{pageBufOffset}} is always 0 there and will not cause the limit to > exceed the capacity. > -- This message was sent by Atlassian Jira (v8.20.10#820010)