[GitHub] [iceberg] rdblue commented on pull request #1566: Parquet: Support Page Skipping in Iceberg Parquet Reader

GitBox Tue, 03 Nov 2020 11:02:00 -0800


rdblue commented on pull request #1566:
URL: https://github.com/apache/iceberg/pull/1566#issuecomment-721317727



   There are a few reasons why Iceberg reimplemented the filters.
   
   * Faster iteration and more features without needing Parquet releases
   * Parquet's filter API has some problems
     * Evaluation is negated (`canDrop` vs `shouldRead`), which has led to more 
bugs
     * It is missing some predicates that we need to be well supported, like 
`startsWith`, `in`, `alwaysTrue`, and `alwaysFalse`
     * The API is very difficult to work with
   * Iceberg replaces record materialization, so we would need to run these 
filters from Iceberg code anyway
   * Iceberg had already implemented similar filters, like stats evaluation, so 
it was simple to reuse that code
   
   To fix some of the issues with the Parquet API, my hope was that eventually 
Parquet would use Iceberg's expression API and filters in place of its own. 
We'd need to refactor a bit to make this happen, but I think it would still be 
a good option. There are several things that I think would be great to 
standardize across some of the storage projects like the `FileIO` classes and 
the expressions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #1566: Parquet: Support Page Skipping in Iceberg Parquet Reader

Reply via email to