wgtmac commented on issue #37559:
URL: https://github.com/apache/arrow/issues/37559#issuecomment-1710967508

   > I would really like to try to decouple Arrow types from the core parquet 
implementation if possible, but this might be too large of a hurdle. There is 
already a callback at the page level that doesn't Arrow can make use of for 
page page level statistics. Figuring out a similar mechanism for the index 
structure makes sense to me.
   > 
   > This only gets to statistics pruning, as noted by the original request we 
likely need a more advanced API to cover lower level filtering.
   
   I have thought the same thing. But it requires a brand-new expression 
interface that deals with parquet physical and logical types. I have 
implemented a native expression interface to the Apache ORC C++ library, so I 
know how much effort it takes.
   
   In our in-house platform, we have ended up with a solution pretty much 
similar to the implementation in the parquet-mr:
   - Use our internal expression and data type to evaluate predicates on 
parquet page index and then return the results in the form of a selected row 
range list for each row group.
   - Modify the parquet row group reader to accept the selected row range list 
and return records that fall into the ranges. In other words, we can push down 
a list of row range to the reader to avoid unnecessary I/O, decompression and 
decoding of filtered pages.
   
   So my proposal is to use a list of selected row ranges to be the adapter 
between parquet layer and arrow layer to implement the predicate push down:
   - In the parquet-cpp library: Support pushing down a list of row ranges to 
the parquet file reader or row group reader. Then the reader should only return 
records that match the row ranges. Reading values via both arrow and non-arrow 
columnar layout are supported.
   - In the arrow layer: Use the arrow expression library for predicate 
evaluation on parquet page index and return the result in the form of a 
selected row ranges in the file/row group.
   
   The proposal above aligns with the idea presented by the paper. We can use a 
selection bitmap to represent the list of selected row ranges. Then the idea to 
apply BMI to push down selection can be integrated seamlessly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to