flyrain commented on a change in pull request #3287:
URL: https://github.com/apache/iceberg/pull/3287#discussion_r733910082



##########
File path: core/src/main/java/org/apache/iceberg/deletes/Deletes.java
##########
@@ -107,6 +108,29 @@ public static StructLikeSet 
toEqualitySet(CloseableIterable<StructLike> eqDelete
     }
   }
 
+  public static Roaring64Bitmap toPositionBitMap(CharSequence dataLocation,
+      CloseableIterable<? extends StructLike> deleteFile) {
+    return toPositionBitMap(dataLocation, ImmutableList.of(deleteFile));
+  }
+
+  public static <T extends StructLike> Roaring64Bitmap 
toPositionBitMap(CharSequence dataLocation,
+      List<CloseableIterable<T>> deleteFiles) {
+    DataFileFilter<T> locationFilter = new DataFileFilter<>(dataLocation);
+    List<CloseableIterable<Long>> positions = Lists.transform(deleteFiles, 
deletes ->
+        CloseableIterable.transform(locationFilter.filter(deletes), row -> 
(Long) POSITION_ACCESSOR.get(row)));
+    return toPositionBitMap(CloseableIterable.concat(positions));
+  }
+
+  public static Roaring64Bitmap toPositionBitMap(CloseableIterable<Long> 
posDeletes) {
+    try (CloseableIterable<Long> deletes = posDeletes) {
+      Roaring64Bitmap bitmap = new Roaring64Bitmap();

Review comment:
       Here is the benchmark I did. I generated random positions out of 1M 
rows, ranging from 10 positions to 500K positions(50% of the total num of 
rows). I used `SerializedSizeInBytes` to get the memory footprint of the 
bitmap. It's not perfect to show how much memory really used, but a good 
indicator. There'd be some extra index needed, which should be minor. To get 
better comparison, I also present the memory usage by using the `set<Long>`, to 
make it simple, I use the `Cardinality * length of a long(8 Bytes)`, extra 
overhead is needed for a `set<Long>`, but it doesn't affect our comparison. The 
results look promising, especially with bigger cardinalities, but even it is 
less than 1k, like 10 or 100. It is still good, at least not bad, the 
10-cardinality case is still OK considering these numbers are random across 0 
to 1M. In conclusion, we can get much better memory efficient by using the 
RoaringBitmap, and don't have to be limited by the num of deleted rows.
   <img width="1151" alt="Screen Shot 2021-10-21 at 10 03 18 AM" 
src="https://user-images.githubusercontent.com/1322359/138324201-6793b2bb-d6dc-4960-af92-51cf13d50475.png";>
   
   <img width="594" alt="Screen Shot 2021-10-21 at 10 03 43 AM" 
src="https://user-images.githubusercontent.com/1322359/138324261-c47d8935-1e34-45d7-a31b-e0e8069f8337.png";>
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to