aokolnychyi commented on a change in pull request #3287:
URL: https://github.com/apache/iceberg/pull/3287#discussion_r733183571



##########
File path: core/src/main/java/org/apache/iceberg/deletes/Deletes.java
##########
@@ -107,6 +108,29 @@ public static StructLikeSet 
toEqualitySet(CloseableIterable<StructLike> eqDelete
     }
   }
 
+  public static Roaring64Bitmap toPositionBitMap(CharSequence dataLocation,
+      CloseableIterable<? extends StructLike> deleteFile) {
+    return toPositionBitMap(dataLocation, ImmutableList.of(deleteFile));
+  }
+
+  public static <T extends StructLike> Roaring64Bitmap 
toPositionBitMap(CharSequence dataLocation,
+      List<CloseableIterable<T>> deleteFiles) {
+    DataFileFilter<T> locationFilter = new DataFileFilter<>(dataLocation);
+    List<CloseableIterable<Long>> positions = Lists.transform(deleteFiles, 
deletes ->
+        CloseableIterable.transform(locationFilter.filter(deletes), row -> 
(Long) POSITION_ACCESSOR.get(row)));
+    return toPositionBitMap(CloseableIterable.concat(positions));
+  }
+
+  public static Roaring64Bitmap toPositionBitMap(CloseableIterable<Long> 
posDeletes) {
+    try (CloseableIterable<Long> deletes = posDeletes) {
+      Roaring64Bitmap bitmap = new Roaring64Bitmap();

Review comment:
       Okay, I spent some time reading/thinking about it.
   
   Using a set of positions is a simple solution and would work fine for a 
small number of deletes. If the number of deletes is substantial, we may want 
to consider a bitmap. However, we have to use a bitmap that handles sparse 
datasets. In that case, Java `BitSet` would be perform terribly, most likely 
worse than a hash set of positions. According to what I read about 
`RoaringBitmap`, it does handle sparse datasets much better than Java `BitSet` 
but it can still occupy more memory than a simple hash set if values are indeed 
sparse.
   
   The question is whether we are optimizing too much here. I'd do a quick 
benchmark and see if `RoaringBitmap` needs a lot of memory for sparse datasets 
when we have a couple of deletes at the beginning and end of range. If it is 
not bad, I'd use it all the time as we know it will be beneficial if we have a 
substantial number of deletes.
   
   Druid also uses `RoaringBitmapWriter` to construct a bitmap. It is supposed 
to be more efficient than setting bit by bit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to