goldmedal commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2848568844
Thanks @2010YOUY01 for the suggestions. > 1. Use the term `(selection) bitmap` instead of `selection vector` to avoid confusion. I believe `selection vector` commonly refers (in several recent papers) to vectors of valid indices like `[1, 3, 9, ...]`. See https://db.cs.cmu.edu/papers/2021/ngom-damon2021.pdf Agreed. The `selection bitmap` is a better name for the field. We can use `array.as_boolean().values().set_indices()` to get the true selection vector from a column. I did it in my implementation for hash-aggregation https://github.com/goldmedal/datafusion/pull/4#discussion_r2051711975 > 2. Perhaps after this PR—and before implementing any optimization using the selection bitmap—we should first extend `ExecutionPlan` to include related properties like `handles_filtered_input()` and `output_filtered_batches()`, to indicate whether an operator can process batches with metadata filter columns (bitmap/selection vector). We should also add an optimizer pass to validate that if an operator outputs filtered batches, its downstream operator must be able to handle them. > Providing the API to check the selection bitmap makes sense to me 👍. I'll consider how to do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org