aokolnychyi commented on a change in pull request #3287:
URL: https://github.com/apache/iceberg/pull/3287#discussion_r733183571
##########
File path: core/src/main/java/org/apache/iceberg/deletes/Deletes.java
##########
@@ -107,6 +108,29 @@ public static StructLikeSet
toEqualitySet(CloseableIterable<StructLike> eqDelete
}
}
+ public static Roaring64Bitmap toPositionBitMap(CharSequence dataLocation,
+ CloseableIterable<? extends StructLike> deleteFile) {
+ return toPositionBitMap(dataLocation, ImmutableList.of(deleteFile));
+ }
+
+ public static <T extends StructLike> Roaring64Bitmap
toPositionBitMap(CharSequence dataLocation,
+ List<CloseableIterable<T>> deleteFiles) {
+ DataFileFilter<T> locationFilter = new DataFileFilter<>(dataLocation);
+ List<CloseableIterable<Long>> positions = Lists.transform(deleteFiles,
deletes ->
+ CloseableIterable.transform(locationFilter.filter(deletes), row ->
(Long) POSITION_ACCESSOR.get(row)));
+ return toPositionBitMap(CloseableIterable.concat(positions));
+ }
+
+ public static Roaring64Bitmap toPositionBitMap(CloseableIterable<Long>
posDeletes) {
+ try (CloseableIterable<Long> deletes = posDeletes) {
+ Roaring64Bitmap bitmap = new Roaring64Bitmap();
Review comment:
Okay, I spent some time reading/thinking about it.
Using a set of positions is a simple solution and would work fine for a
small number of deletes. If the number of deletes is substantial, we may want
to consider a bitmap. However, we have to use a bitmap that handles sparse
datasets. In that case, Java `BitSet` would be perform terribly, most likely
worse than a hash set of positions. According to what I read about
`RoaringBitmap`, it does handle sparse datasets much better than Java `BitSet`
but it can still occupy more memory than a simple hash set if values are indeed
sparse.
The question is whether we are optimizing too much here. I'd do a quick
benchmark and see if `RoaringBitmap` needs a lot of memory for sparse datasets
when we have a couple of deletes at the beginning and end of range. If it is
not bad, I'd use it all the time as we know it will be beneficial if we have a
substantial number of deletes.
Druid also uses `RoaringBitmapWriter` to construct a bitmap. It is supposed
to be more efficient than setting bit by bit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]