emkornfield commented on issue #37559: URL: https://github.com/apache/arrow/issues/37559#issuecomment-1711086509
> I have thought the same thing. But it requires a brand-new expression interface that deals with parquet physical and logical types. I have implemented a native expression interface to the Apache ORC C++ library, so I know how much effort it takes. > > In our in-house platform, we have ended up with a solution pretty much similar to the implementation in the parquet-mr: > > * Use our internal expression and data type to evaluate predicates on parquet page index and then return the results in the form of a selected row range list for each row group. > * Modify the parquet row group reader to accept the selected row range list and return records that fall into the ranges. In other words, we can push down a list of row range to the reader to avoid unnecessary I/O, decompression and decoding of filtered pages. > > So my proposal is to use a list of selected row ranges to be the adapter between parquet layer and arrow layer to implement the predicate push down: > > * In the parquet-cpp library: Support pushing down a list of row ranges to the parquet file reader or row group reader. Then the reader should only return records that match the row ranges. Reading values via both arrow and non-arrow columnar layout are supported. Agreed, I think this should be achievable with the [callback](https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.h#L150) that was added, or at least we can maybe extend it? > * In the arrow layer: Use the arrow expression library for predicate evaluation on parquet page index and return the result in the form of a selected row ranges in the file/row group. Agreed this is a first good step and I think yields the decoupling I was looking for. As long as we are reading arrow data, it makes sense to use the arrow expression library, but my aim was to not intermingle arrow types more directly with the core parquet reader. > > The proposal above aligns with the idea presented by the paper. We can use a selection bitmap to represent the list of selected row ranges. Then the idea to apply BMI to push down selection can be integrated seamlessly. I think the proposal this provides a good first pass. IIUC correctly we probably want to take this further at some point to have a callback that can be presented with encoded (or at least an intermediate form) of data where it makes sense to apply the filter (some forms might be more useful then others) to get back the row set. @mapleFU haven't given thoughts to specifics beyond this. I think the net new item is an interface for getting presented with encoded parquet data. As noted above at a pseudo-code level, we'd probably needs to be something that can be integrated into the decoders directly that will continue to emit values, but also record filtered ranges. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
