Github user lemire commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9661#discussion_r44730743
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
    @@ -176,15 +179,17 @@ private[spark] object HighlyCompressedMapStatus {
         // From a compression standpoint, it shouldn't matter whether we track 
empty or non-empty
         // blocks. From a performance standpoint, we benefit from tracking 
empty blocks because
         // we expect that there will be far fewer of them, so we will perform 
fewer bitmap insertions.
    +    val emptyBlocks = new RoaringBitmap()
    +    val nonEmptyBlocks = new RoaringBitmap()
         val totalNumBlocks = uncompressedSizes.length
    -    val emptyBlocks = new BitSet(totalNumBlocks)
         while (i < totalNumBlocks) {
           var size = uncompressedSizes(i)
           if (size > 0) {
             numNonEmptyBlocks += 1
    +        nonEmptyBlocks.add(i)
             totalSize += size
           } else {
    -        emptyBlocks.set(i)
    +        emptyBlocks.add(i)
           }
    --- End diff --
    
    I am not exactly sure why you would keep track of both the empty and 
non-empty blocks. Seems redundant. But maybe I misunderstand.  But the comment 
above is probably right: you'd want to minimize the number of bitmap insertions.
    
    By design, even without ``runOptimize``, ``RoaringBitmap`` should not use 
more than 16 bits per value asymptotically. (Meaning that if you have just a 
few values, they will end up using more than 16 bits, but if you have 
thousands... then they should use 16 bits or less.) So while you cannot be sure 
that a ``RoaringBitmap`` will "compress well" for some meaning of "compress 
well", you can pretty much bound the memory usage with respect to 
totalNumBlocks. So for 100k possible values to be added, that would be 200kB... 
i.e., it would still fit in L2 cache. And that's an upper bound. Asympotically 
you are also sure not to use more memory than a ``BitSet`` with its wordsInUse 
set to the size of the universe, for a huge universe size, so if 100k is  
``totalNumBlocks`` then you neither a ``RoaringBitmap`` or a ``BitSet`` should 
use more than 12kB. 
    
    So my guess is that a single call to ``runOptimize`` at the end would be 
enough... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to