Kontinuation commented on code in PR #563:
URL: https://github.com/apache/sedona-db/pull/563#discussion_r2757128734
##########
rust/sedona-spatial-join/src/utils/join_utils.rs:
##########
@@ -93,6 +115,161 @@ pub(crate) fn get_final_indices_from_bit_map(
(left_indices, right_indices)
}
+pub(crate) fn adjust_indices_with_visited_info(
+ left_indices: UInt64Array,
+ right_indices: UInt32Array,
+ adjust_range: Range<usize>,
+ join_type: JoinType,
+ preserve_order_for_right: bool,
+ visited_info: Option<(&mut BooleanBufferBuilder, usize)>,
+ produce_unmatched_probe_rows: bool,
+) -> Result<(UInt64Array, UInt32Array)> {
+ let Some((bitmap, offset)) = visited_info else {
+ return adjust_indices_by_join_type(
+ left_indices,
+ right_indices,
+ adjust_range,
+ join_type,
+ preserve_order_for_right,
+ );
+ };
+
+ // Update the bitmap with the current matches first
+ for idx in right_indices.values() {
+ bitmap.set_bit(offset + (*idx as usize), true);
+ }
+
+ match join_type {
+ JoinType::Right | JoinType::Full => {
+ if !produce_unmatched_probe_rows {
+ Ok((left_indices, right_indices))
+ } else {
+ let unmatched_count = adjust_range
+ .clone()
+ .filter(|&i| !bitmap.get_bit(i + offset))
+ .count();
Review Comment:
Unfortunately Arrow's BooleanBufferBuilder does not provide optimized
methods for iterating over bit ranges, other join_utils code inherited from
DataFusion also did this, so I sticked to using BooleanBufferBuilder for
visited bitset to be consistent with the rest of the code.
It didn't show up as a performance bottleneck when running outer joins
before, perhaps the other parts of the join is far more heavy weight than
bitmap traversal.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]