Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/9661#discussion_r44726575
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -176,15 +179,17 @@ private[spark] object HighlyCompressedMapStatus {
// From a compression standpoint, it shouldn't matter whether we track
empty or non-empty
// blocks. From a performance standpoint, we benefit from tracking
empty blocks because
// we expect that there will be far fewer of them, so we will perform
fewer bitmap insertions.
+ val emptyBlocks = new RoaringBitmap()
+ val nonEmptyBlocks = new RoaringBitmap()
val totalNumBlocks = uncompressedSizes.length
- val emptyBlocks = new BitSet(totalNumBlocks)
while (i < totalNumBlocks) {
var size = uncompressedSizes(i)
if (size > 0) {
numNonEmptyBlocks += 1
+ nonEmptyBlocks.add(i)
totalSize += size
} else {
- emptyBlocks.set(i)
+ emptyBlocks.add(i)
}
--- End diff --
+1
Would this also eliminate the need to even bother w/ both `emptyBlocks` and
`nonEmptyBlocks`? Eg., after `emptyBlocks.runOptimize` its just as good as
storing the empty blocks? After this point, we only care about the memory used
and the time it takes to call `contains` -- these are totally immutable after
this.
Does it also make sense to call `runOptimize` periodically as these are
being built, to avoid too much memory being used? Say the upper end of the
size of these is ~100k. So the worst case would be storing 100k shorts, or
~200KB, before we call `runOptimize`? That isn't really too bad, so I'm
inclined to keep things simple, but just thought it was worth thinking about
this now while we're looking.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]