Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9661#discussion_r44726575
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -176,15 +179,17 @@ private[spark] object HighlyCompressedMapStatus {
         // From a compression standpoint, it shouldn't matter whether we track 
empty or non-empty
         // blocks. From a performance standpoint, we benefit from tracking 
empty blocks because
         // we expect that there will be far fewer of them, so we will perform 
fewer bitmap insertions.
    +    val emptyBlocks = new RoaringBitmap()
    +    val nonEmptyBlocks = new RoaringBitmap()
         val totalNumBlocks = uncompressedSizes.length
    -    val emptyBlocks = new BitSet(totalNumBlocks)
         while (i < totalNumBlocks) {
           var size = uncompressedSizes(i)
           if (size > 0) {
             numNonEmptyBlocks += 1
    +        nonEmptyBlocks.add(i)
             totalSize += size
           } else {
    -        emptyBlocks.set(i)
    +        emptyBlocks.add(i)
           }
    --- End diff --
    
    +1
    Would this also eliminate the need to even bother w/ both `emptyBlocks` and 
`nonEmptyBlocks`?  Eg., after `emptyBlocks.runOptimize` its just as good as 
storing the empty blocks?  After this point, we only care about the memory used 
and the time it takes to call `contains` -- these are totally immutable after 
this.
    
    Does it also make sense to call `runOptimize` periodically as these are 
being built, to avoid too much memory being used?  Say the upper end of the 
size of these is ~100k.  So the worst case would be storing 100k shorts, or 
~200KB, before we call `runOptimize`?  That isn't really too bad, so I'm 
inclined to keep things simple, but just thought it was worth thinking about 
this now while we're looking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to