ArnavBalyan opened a new issue, #3320:
URL: https://github.com/apache/parquet-java/issues/3320

   ### Describe the enhancement requested
   
   Parquet-go can write extra entries in the column index for 1 column chunk. 
This causes Parquet java to fail with java.lang.ArrayIndexOutOfBoundsException 
when using column indexes.
    - Parquet java expects the same number of entries in column index and 
offset index (one entry per data page).
    - When they don't match, the reader seems to be trying to read a non 
existent page (based on the column stats) and fails. (stacktrace below)
   
   Structure of indexes which lead to the failure:
   DEBUG: COLUMN1 ColumnIndex=5 stats
     Entry 0: nullPage=false, nullCount=0, min='', max=''
     Entry 1: nullPage=false, nullCount=0, min='xyz', max='abc'
     Entry 2: nullPage=false, nullCount=0, min='xyz', max='abc'
     Entry 3: nullPage=false, nullCount=0, min='xyz', max='abc'
     Entry 4: nullPage=false, nullCount=0, min='xyz', max='abc'
   
   DEBUG: COLUMN1 OffsetIndex=4 pages
     Entry 0: offset=..., compressedSize=..., firstRowIndex=0
     Entry 1: offset=..., compressedSize=..., firstRowIndex=30
     Entry 2: offset=..., compressedSize=..., firstRowIndex=60
     Entry 3: offset=..., compressedSize=..., firstRowIndex=90
   
   (Here we have 1 additional column index for the same data page)
   
   
   There could be potentially 2 ways to fix this:
   1. Detect inconsistent metadata and do not rely on column index for 
unsuitable/incomplete metadata. Log a WARN and continue reading the file 
without index, which will correctly read the file.
   2. Clearly throw an exception indicating the missing/unhandled metadata (in 
this case, the mismatch in number of column/offset index), so user can disable 
column index or correct upstream.
   
   
   Stacktrace: 
   ```
           at 
org.apache.parquet.internal.column.columnindex.OffsetIndexBuilder$OffsetIndexImpl.getFirstRowIndex(OffsetIndexBuilder.java:66)
           at 
org.apache.parquet.internal.filter2.columnindex.RowRanges.create(RowRanges.java:144)
 
           at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.applyPredicate(ColumnIndexFilter.java:189)
           at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:126)
           at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.visit(ColumnIndexFilter.java:57)
 
           at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:192)
           at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:87)
 
           at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter$1.visit(ColumnIndexFilter.java:82)
 
           at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:149)
 
           at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:82)
           at 
org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:1219)
 
           at 
org.apache.parquet.hadoop.ParquetFileReader.getFilteredRecordCount(ParquetFileReader.java:875)
 
   ...
   ```
   
   ### Component(s)
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to