westonpace commented on issue #37559:
URL: https://github.com/apache/arrow/issues/37559#issuecomment-1737683805

   Thanks for the summary document @mapleFU .  This all sounds pretty cool to 
me.
   
   > I think we have decisions to make in multiple dimensions: 1) Filter 
expression/parsing 2) Filter pushdown 3) Filter evaluation 4) Types of filters 
to support (equality, range, etc). It would be nice to call these out to make 
it clear.
   
   I agree, there is a lot to figure out here.  However, I do think arrow-cpp's 
compute module has a lot of the pieces that are needed already, and I think 
parquet-cpp already depends on the compute module.  If you want to keep 
depending on arrow-cpp for compute then I think page filtering with statistics 
should be straightforward (can mostly copy what is in datasets).
   
   The rest would be more effort.  It sounds like there is some plan to use 
selection vectors (which makes a lot of sense) but arrow compute doesn't use 
selection vectors today.  However, there are a lot of really good bitmap 
utilities in arrow-cpp.  I think anyone wanting to implement selection vectors 
should probably review what's available there first.
   
   Also, https://github.com/RoaringBitmap/CRoaring might be an interesting 
concept to read up on when it comes to selection vectors.
   
   Maybe a good way to get started on this work would be to create some 
microbenchmarks that do not currently perform well in parquet-cpp?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to