shangxinli commented on pull request #1566: URL: https://github.com/apache/iceberg/pull/1566#issuecomment-723645208
@rdblue I looked at the implementation of ColumnIndex. Here are what we need to do to rewrite this filter and use it in Iceberg. 1. Rewrite the equivalent filters in the Parquet ColumnIndexFilter class. 2. Rewrite ParquetFileReader#getRowRanges(), which applies the column index filters and get rowRanges (RowRanges). 3. Use the filtered rowRanges gotten in step 2 to build a PageReadStore like ColumnChunkPageReadStore https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L976. Unfortunately ColumnChunkPageReadStore() is not a public class for Iceberg to use. This would require us to rewrite our own PageReadStore, or let Parquet make it public. 4. With the filtered rowRanges to get filteredOffsetIndex, we can read all the parts needed with it. Then we need SynchronizingColumnReader to synchronize cross columns. Unfortunately, SynchronizingColumnReader is not a public class either. We need to rewrite it in Iceberg. 5. Wrap up the step #2, #3, #4 into a method to replace readNextFilteredRowGroup() in Parquet. The work seems more than just rewriting the filter but also rewriting ColumnChunkPageReadStore, SynchronizingColumnReader, and partially ParquetFileReader. This makes me think should we continue rewriting or reusing? If we choose to reuse, we can limit the filter not to support startWith, in.., or we can add them into Parquet. Let me know you thought. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
