flyrain commented on a change in pull request #3287:
URL: https://github.com/apache/iceberg/pull/3287#discussion_r733910082



##########
File path: core/src/main/java/org/apache/iceberg/deletes/Deletes.java
##########
@@ -107,6 +108,29 @@ public static StructLikeSet 
toEqualitySet(CloseableIterable<StructLike> eqDelete
     }
   }
 
+  public static Roaring64Bitmap toPositionBitMap(CharSequence dataLocation,
+      CloseableIterable<? extends StructLike> deleteFile) {
+    return toPositionBitMap(dataLocation, ImmutableList.of(deleteFile));
+  }
+
+  public static <T extends StructLike> Roaring64Bitmap 
toPositionBitMap(CharSequence dataLocation,
+      List<CloseableIterable<T>> deleteFiles) {
+    DataFileFilter<T> locationFilter = new DataFileFilter<>(dataLocation);
+    List<CloseableIterable<Long>> positions = Lists.transform(deleteFiles, 
deletes ->
+        CloseableIterable.transform(locationFilter.filter(deletes), row -> 
(Long) POSITION_ACCESSOR.get(row)));
+    return toPositionBitMap(CloseableIterable.concat(positions));
+  }
+
+  public static Roaring64Bitmap toPositionBitMap(CloseableIterable<Long> 
posDeletes) {
+    try (CloseableIterable<Long> deletes = posDeletes) {
+      Roaring64Bitmap bitmap = new Roaring64Bitmap();

Review comment:
       Here is the benchmark I did. I generates random positions out of 1M 
rows. Range from 10 positions to 500K positions(50% of the total num of rows). 
I'm using `SerializedSizeInBytes` to get the memory footprint of the bitmap.  
It's not perfect to show how much memory really used, but a good indicator. 
There'd be some extra index needed, which is minor. To get better comparison, I 
also present the memory usage by using the `set<Long>`, to make it simple, I 
use the `Cardinality * length of a long(8)`, extra overhead is needed for a 
`set<Long>`, but it won't affect our comparison. The results look promising, 
especially with bigger cardinalities, but even it is less than 1k, like 10 or 
100. It is still good, at least not bad, the 10-cardinality case is still OK 
considering these numbers are random across 0 to 1M. In conclusion, we can get 
much better memory efficient by using the RoaringBitmap, and don't have to be 
limited by the num of deleted rows.
   <img width="1151" alt="Screen Shot 2021-10-21 at 10 03 18 AM" 
src="https://user-images.githubusercontent.com/1322359/138324201-6793b2bb-d6dc-4960-af92-51cf13d50475.png";>
   
   <img width="594" alt="Screen Shot 2021-10-21 at 10 03 43 AM" 
src="https://user-images.githubusercontent.com/1322359/138324261-c47d8935-1e34-45d7-a31b-e0e8069f8337.png";>
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to