westonpace commented on issue #37559: URL: https://github.com/apache/arrow/issues/37559#issuecomment-1737683805
Thanks for the summary document @mapleFU . This all sounds pretty cool to me. > I think we have decisions to make in multiple dimensions: 1) Filter expression/parsing 2) Filter pushdown 3) Filter evaluation 4) Types of filters to support (equality, range, etc). It would be nice to call these out to make it clear. I agree, there is a lot to figure out here. However, I do think arrow-cpp's compute module has a lot of the pieces that are needed already, and I think parquet-cpp already depends on the compute module. If you want to keep depending on arrow-cpp for compute then I think page filtering with statistics should be straightforward (can mostly copy what is in datasets). The rest would be more effort. It sounds like there is some plan to use selection vectors (which makes a lot of sense) but arrow compute doesn't use selection vectors today. However, there are a lot of really good bitmap utilities in arrow-cpp. I think anyone wanting to implement selection vectors should probably review what's available there first. Also, https://github.com/RoaringBitmap/CRoaring might be an interesting concept to read up on when it comes to selection vectors. Maybe a good way to get started on this work would be to create some microbenchmarks that do not currently perform well in parquet-cpp? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
