alamb commented on issue #5944: URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1502083042
> One idea would be to create the filtered batches by copying the arrays and mutating the validity bitmap to hide the rows that are filtered out. This would potentially change the semantics in some cases though so we can probably only do this under certain conditions. I think this basic idea is called a "selection vector" in the literature -- and as you hint at, it is not quite the same as the null mask as it has different semantics. One approach might be to add another enum type to `ColumnarValue` that had an additional validity mask https://github.com/apache/arrow-datafusion/blob/bbc71692fcd8dd9f3a9686162e59d092b37031f2/datafusion/expr/src/columnar_value.rs#L33 After @tustvold 's recent work in Arrow, I think this would just be a https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html and should be straightforward to use. To really take advantage of a selection vector, however, the underlying compute kernels need to be updated to know how to ignore the selection vectors (and likely only do so when they are sparse) > Another idea is to update the aggregate logic to perform the predicate evaluation and then use the resulting bitmap to determine which rows to accumulate. While not exactly the same, @yjshen 's has been workking to add filtering to the aggregate input here, which is similar: https://github.com/apache/arrow-datafusion/pull/5868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
