zeroshade commented on code in PR #1053:
URL: https://github.com/apache/iceberg-go/pull/1053#discussion_r3255923256
##########
table/dv/roaring_bitmap.go:
##########
@@ -91,6 +95,37 @@ func (b *RoaringPositionBitmap) Cardinality() int64 {
return c
}
+// Iter yields the set positions in ascending order. Lazy — preferred over
+// ToArray when the caller can consume positions one-at-a-time (e.g. feeding
+// an Arrow builder), since a DV bitmap can hold millions of positions.
+//
+// The bitmap must not be modified while an iterator is running: the closure
+// captures b by pointer and the underlying roaring.Bitmap library does not
+// support concurrent Set/Iterator. Single-goroutine consumption is the only
+// supported pattern.
+func (b *RoaringPositionBitmap) Iter() iter.Seq[uint64] {
Review Comment:
With the shift to using bitmaps directly, do we still need this? Is it
unused? Same for ToArray?
##########
table/arrow_scanner.go:
##########
@@ -193,6 +323,44 @@ func processPositionalDeletes(ctx context.Context, deletes
set[int64]) recProces
}
}
+// filterByDeletionVector returns a pipeline step that drops rows present in
+// the bitmap by building a per-batch Arrow Boolean keep-mask and applying
+// compute.Filter. Preferred over processPositionalDeletes for DV-sourced
+// deletes because RoaringPositionBitmap.Contains is O(1) and Filter with a
+// bit-packed boolean mask is more vectorized than Take with an int64 index
+// array. Tracks absolute row position across batches via a closure-captured
+// counter, identical pattern to processPositionalDeletes.
+func filterByDeletionVector(ctx context.Context, bitmap
*dv.RoaringPositionBitmap) recProcessFn {
+ nextIdx, mem := int64(0), compute.GetAllocator(ctx)
+
+ return func(r arrow.RecordBatch) (arrow.RecordBatch, error) {
+ defer r.Release()
+
+ currentIdx := nextIdx
+ nextIdx += r.NumRows()
+
+ maskBuilder := array.NewBooleanBuilder(mem)
+ defer maskBuilder.Release()
+ maskBuilder.Reserve(int(r.NumRows()))
+ for i := int64(0); i < r.NumRows(); i++ {
Review Comment:
Is there no more efficient way to get a standard bitmap from a roaring
bitmap? Looks like you could call ToDense on the roaring bitmap once to convert
it to a regular bitmap and create an array.Boolean, then just use slices of
that array (`array.NewSlice(bm, currentIdx, currentIdx+r.NumRows())`) to pass
into the Call to filter.
Would simplify this and would reduce the allocations
##########
table/arrow_scanner.go:
##########
@@ -193,6 +323,44 @@ func processPositionalDeletes(ctx context.Context, deletes
set[int64]) recProces
}
}
+// filterByDeletionVector returns a pipeline step that drops rows present in
+// the bitmap by building a per-batch Arrow Boolean keep-mask and applying
+// compute.Filter. Preferred over processPositionalDeletes for DV-sourced
+// deletes because RoaringPositionBitmap.Contains is O(1) and Filter with a
+// bit-packed boolean mask is more vectorized than Take with an int64 index
+// array. Tracks absolute row position across batches via a closure-captured
+// counter, identical pattern to processPositionalDeletes.
+func filterByDeletionVector(ctx context.Context, bitmap
*dv.RoaringPositionBitmap) recProcessFn {
+ nextIdx, mem := int64(0), compute.GetAllocator(ctx)
+
+ return func(r arrow.RecordBatch) (arrow.RecordBatch, error) {
+ defer r.Release()
+
+ currentIdx := nextIdx
+ nextIdx += r.NumRows()
+
+ maskBuilder := array.NewBooleanBuilder(mem)
+ defer maskBuilder.Release()
+ maskBuilder.Reserve(int(r.NumRows()))
+ for i := int64(0); i < r.NumRows(); i++ {
+ // mask[i] = keep row i? → row i of the batch is at
absolute
+ // position currentIdx+i in the source file; keep if
NOT in
+ // the deletion vector.
+ maskBuilder.Append(!bitmap.Contains(uint64(currentIdx +
i)))
+ }
+ mask := maskBuilder.NewBooleanArray()
+ defer mask.Release()
+
+ out, err := compute.Filter(ctx,
compute.NewDatumWithoutOwning(r),
Review Comment:
Can simplify with
https://pkg.go.dev/github.com/apache/arrow-go/[email protected]/arrow/compute#FilterRecordBatch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]