[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876466#comment-15876466
 ] 

Imran Rashid commented on SPARK-19659:
--------------------------------------

[~jinxing6...@126.com]  Thanks for taking this on, I think this is a *really* 
important improvement for Spark -- but its also changing some very core logic 
which I think needs to be done with a lot of caution.  Can you please post a 
design doc here for discussion?

While the heuristics you are proposing seem reasonable, I have a number of 
concerns:

* what about when there are > 2k partitions, and the block size is unknown?  
especially in the case of skew, this is a huge problem.  Perhaps first we 
should just tackle that problem, to have better size estimations (with bounded 
error) in that case.
* I think it will need to configured independently from maxBytesInFlight
* Would it be possible to make the shuffle fetch memory usage get tracked by 
the memorymanager?  That would be another way to avoid OOM.  Note this pretty 
tricky since right now that memory is controlled by netty.
* what are the performance ramifications of these changes?  What tests are done 
to understand the effects?

I still think that having the shuffle fetch streamed to disk is a good idea, 
but we should think carefully about the right way to control it, and some of 
these other ideas should be done first, perhaps.  Its at least worth discussing 
before just doing the implementation.

> Fetch big blocks to disk when shuffle-read
> ------------------------------------------
>
>                 Key: SPARK-19659
>                 URL: https://issues.apache.org/jira/browse/SPARK-19659
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.1.0
>            Reporter: jin xing
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to