zeroshade commented on code in PR #1053:
URL: https://github.com/apache/iceberg-go/pull/1053#discussion_r3255923256


##########
table/dv/roaring_bitmap.go:
##########
@@ -91,6 +95,37 @@ func (b *RoaringPositionBitmap) Cardinality() int64 {
        return c
 }
 
+// Iter yields the set positions in ascending order. Lazy — preferred over
+// ToArray when the caller can consume positions one-at-a-time (e.g. feeding
+// an Arrow builder), since a DV bitmap can hold millions of positions.
+//
+// The bitmap must not be modified while an iterator is running: the closure
+// captures b by pointer and the underlying roaring.Bitmap library does not
+// support concurrent Set/Iterator. Single-goroutine consumption is the only
+// supported pattern.
+func (b *RoaringPositionBitmap) Iter() iter.Seq[uint64] {

Review Comment:
   With the shift to using bitmaps directly, do we still need this? Is it 
unused? Same for ToArray?



##########
table/arrow_scanner.go:
##########
@@ -193,6 +323,44 @@ func processPositionalDeletes(ctx context.Context, deletes 
set[int64]) recProces
        }
 }
 
+// filterByDeletionVector returns a pipeline step that drops rows present in
+// the bitmap by building a per-batch Arrow Boolean keep-mask and applying
+// compute.Filter. Preferred over processPositionalDeletes for DV-sourced
+// deletes because RoaringPositionBitmap.Contains is O(1) and Filter with a
+// bit-packed boolean mask is more vectorized than Take with an int64 index
+// array. Tracks absolute row position across batches via a closure-captured
+// counter, identical pattern to processPositionalDeletes.
+func filterByDeletionVector(ctx context.Context, bitmap 
*dv.RoaringPositionBitmap) recProcessFn {
+       nextIdx, mem := int64(0), compute.GetAllocator(ctx)
+
+       return func(r arrow.RecordBatch) (arrow.RecordBatch, error) {
+               defer r.Release()
+
+               currentIdx := nextIdx
+               nextIdx += r.NumRows()
+
+               maskBuilder := array.NewBooleanBuilder(mem)
+               defer maskBuilder.Release()
+               maskBuilder.Reserve(int(r.NumRows()))
+               for i := int64(0); i < r.NumRows(); i++ {

Review Comment:
   Is there no more efficient way to get a standard bitmap from a roaring 
bitmap? Looks like you could call ToDense on the roaring bitmap once to convert 
it to a regular bitmap and create an array.Boolean, then just use slices of 
that array (`array.NewSlice(bm, currentIdx, currentIdx+r.NumRows())`) to pass 
into the Call to filter. 
   
   Would simplify this and would reduce the allocations 



##########
table/arrow_scanner.go:
##########
@@ -193,6 +323,44 @@ func processPositionalDeletes(ctx context.Context, deletes 
set[int64]) recProces
        }
 }
 
+// filterByDeletionVector returns a pipeline step that drops rows present in
+// the bitmap by building a per-batch Arrow Boolean keep-mask and applying
+// compute.Filter. Preferred over processPositionalDeletes for DV-sourced
+// deletes because RoaringPositionBitmap.Contains is O(1) and Filter with a
+// bit-packed boolean mask is more vectorized than Take with an int64 index
+// array. Tracks absolute row position across batches via a closure-captured
+// counter, identical pattern to processPositionalDeletes.
+func filterByDeletionVector(ctx context.Context, bitmap 
*dv.RoaringPositionBitmap) recProcessFn {
+       nextIdx, mem := int64(0), compute.GetAllocator(ctx)
+
+       return func(r arrow.RecordBatch) (arrow.RecordBatch, error) {
+               defer r.Release()
+
+               currentIdx := nextIdx
+               nextIdx += r.NumRows()
+
+               maskBuilder := array.NewBooleanBuilder(mem)
+               defer maskBuilder.Release()
+               maskBuilder.Reserve(int(r.NumRows()))
+               for i := int64(0); i < r.NumRows(); i++ {
+                       // mask[i] = keep row i? → row i of the batch is at 
absolute
+                       // position currentIdx+i in the source file; keep if 
NOT in
+                       // the deletion vector.
+                       maskBuilder.Append(!bitmap.Contains(uint64(currentIdx + 
i)))
+               }
+               mask := maskBuilder.NewBooleanArray()
+               defer mask.Release()
+
+               out, err := compute.Filter(ctx, 
compute.NewDatumWithoutOwning(r),

Review Comment:
   Can simplify with 
https://pkg.go.dev/github.com/apache/arrow-go/[email protected]/arrow/compute#FilterRecordBatch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to