flyrain commented on a change in pull request #3287:
URL: https://github.com/apache/iceberg/pull/3287#discussion_r733910082
##########
File path: core/src/main/java/org/apache/iceberg/deletes/Deletes.java
##########
@@ -107,6 +108,29 @@ public static StructLikeSet
toEqualitySet(CloseableIterable<StructLike> eqDelete
}
}
+ public static Roaring64Bitmap toPositionBitMap(CharSequence dataLocation,
+ CloseableIterable<? extends StructLike> deleteFile) {
+ return toPositionBitMap(dataLocation, ImmutableList.of(deleteFile));
+ }
+
+ public static <T extends StructLike> Roaring64Bitmap
toPositionBitMap(CharSequence dataLocation,
+ List<CloseableIterable<T>> deleteFiles) {
+ DataFileFilter<T> locationFilter = new DataFileFilter<>(dataLocation);
+ List<CloseableIterable<Long>> positions = Lists.transform(deleteFiles,
deletes ->
+ CloseableIterable.transform(locationFilter.filter(deletes), row ->
(Long) POSITION_ACCESSOR.get(row)));
+ return toPositionBitMap(CloseableIterable.concat(positions));
+ }
+
+ public static Roaring64Bitmap toPositionBitMap(CloseableIterable<Long>
posDeletes) {
+ try (CloseableIterable<Long> deletes = posDeletes) {
+ Roaring64Bitmap bitmap = new Roaring64Bitmap();
Review comment:
Here is the benchmark I did. I generated random positions out of 1M
rows, ranging from 10 positions to 500K positions(50% of the total num of
rows). I used `SerializedSizeInBytes` to get the memory footprint of the
bitmap. It's not perfect to show how much memory really used, but a good
indicator. There'd be some extra index needed, which should be minor. To get
better comparison, I also present the memory usage by using the `set<Long>`, to
make it simple, I use the `Cardinality * length of a long(8 Bytes)`, extra
overhead is needed for a `set<Long>`, but it doesn't affect our comparison. The
results look promising, especially with bigger cardinalities, but even it is
less than 1k, like 10 or 100. It is still good, at least not bad, the
10-cardinality case is still OK considering these numbers are random across 0
to 1M. In conclusion, we can get much better memory efficient by using the
RoaringBitmap, and don't have to be limited by the num of deleted rows.
<img width="1151" alt="Screen Shot 2021-10-21 at 10 03 18 AM"
src="https://user-images.githubusercontent.com/1322359/138324201-6793b2bb-d6dc-4960-af92-51cf13d50475.png">
<img width="594" alt="Screen Shot 2021-10-21 at 10 03 43 AM"
src="https://user-images.githubusercontent.com/1322359/138324261-c47d8935-1e34-45d7-a31b-e0e8069f8337.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]