wgtmac commented on issue #37559: URL: https://github.com/apache/arrow/issues/37559#issuecomment-1710967508
> I would really like to try to decouple Arrow types from the core parquet implementation if possible, but this might be too large of a hurdle. There is already a callback at the page level that doesn't Arrow can make use of for page page level statistics. Figuring out a similar mechanism for the index structure makes sense to me. > > This only gets to statistics pruning, as noted by the original request we likely need a more advanced API to cover lower level filtering. I have thought the same thing. But it requires a brand-new expression interface that deals with parquet physical and logical types. I have implemented a native expression interface to the Apache ORC C++ library, so I know how much effort it takes. In our in-house platform, we have ended up with a solution pretty much similar to the implementation in the parquet-mr: - Use our internal expression and data type to evaluate predicates on parquet page index and then return the results in the form of a selected row range list for each row group. - Modify the parquet row group reader to accept the selected row range list and return records that fall into the ranges. In other words, we can push down a list of row range to the reader to avoid unnecessary I/O, decompression and decoding of filtered pages. So my proposal is to use a list of selected row ranges to be the adapter between parquet layer and arrow layer to implement the predicate push down: - In the parquet-cpp library: Support pushing down a list of row ranges to the parquet file reader or row group reader. Then the reader should only return records that match the row ranges. Reading values via both arrow and non-arrow columnar layout are supported. - In the arrow layer: Use the arrow expression library for predicate evaluation on parquet page index and return the result in the form of a selected row ranges in the file/row group. The proposal above aligns with the idea presented by the paper. We can use a selection bitmap to represent the list of selected row ranges. Then the idea to apply BMI to push down selection can be integrated seamlessly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
