Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/16989
  
    A few more high-level thoughts about this PR:
    
    - It seems like the benefits here come from three interrelated changes:
      - Improving the accuracy of map output size reporting for large shuffles 
where there is significant skew. This helps the existing `maxBytesInFlight` 
mechanism to avoid OOMs.
      - Taking blocks which are big in absolute terms (e.g. over the 200 MB 
threshold) and not even trying to buffer them in memory.
      - Using the MemoryManager to start forcing requests to disk when we 
detect a memory crunch.
    
    It seems like the third piece (memory manager integration) is the only one 
which might have tricky problems; the other two are straightforward and don't 
impact internal APIs that much. Therefore, what would you say about deferring 
that piece for now and only merging the first two pieces, then tackling the 
memory manager in a followup? My hunch is that the first two improvements give 
us most of the gains at very little complexity cost compared with trying to 
integrate with off heap memory accounting in a new way.
    
    (If you wanted to you could split this into two PRs: one which deals only 
with MapStatus compression accuracy improvements and another which forces 
blocks to disk over a certain fixed threshold).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to