shangxinli commented on pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#issuecomment-723645208


   @rdblue I looked at the implementation of ColumnIndex. Here are what we need 
to do to rewrite this filter and use it in Iceberg. 
   1. Rewrite the equivalent filters in the Parquet ColumnIndexFilter class. 
   2. Rewrite ParquetFileReader#getRowRanges(), which applies the column index 
filters and get rowRanges (RowRanges).  
   3. Use the filtered rowRanges gotten in step 2 to build a PageReadStore like 
ColumnChunkPageReadStore 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L976.
 Unfortunately ColumnChunkPageReadStore() is not a public class for Iceberg to 
use. This would require us to rewrite our own PageReadStore, or let Parquet 
make it public. 
   4. With the filtered rowRanges to get filteredOffsetIndex, we can read all 
the parts needed with it. Then we need SynchronizingColumnReader to synchronize 
cross columns. Unfortunately, SynchronizingColumnReader is not a public class 
either. We need to rewrite it in Iceberg.
   5. Wrap up the step #2, #3, #4 into a method to replace 
readNextFilteredRowGroup() in Parquet. 
   
   The work seems more than just rewriting the filter but also rewriting 
ColumnChunkPageReadStore, SynchronizingColumnReader, and partially 
ParquetFileReader. This makes me think should we continue rewriting or reusing? 
If we choose to reuse, we can limit the filter not to support startWith, in.., 
or we can add them into Parquet. Let me know you thought.  
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to