[
https://issues.apache.org/jira/browse/SPARK-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ioannis Deligiannis updated SPARK-13718:
----------------------------------------
Attachment: TestIssue.java
Test Case: Passing JavaSparkContext to test method and tuning some parameters
to match the cluster should reproduce the problem
> Scheduler "creating" straggler node
> ------------------------------------
>
> Key: SPARK-13718
> URL: https://issues.apache.org/jira/browse/SPARK-13718
> Project: Spark
> Issue Type: Improvement
> Components: Scheduler, Spark Core
> Affects Versions: 1.3.1
> Environment: Spark 1.3.1
> MapR-FS
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 48 cores & 512 RAM / node
> Data Replication factor of 3
> Each Node has 4 Spark executors configured with 12 cores each and 22GB of RAM.
> Reporter: Ioannis Deligiannis
> Priority: Minor
> Attachments: TestIssue.java, spark_struggler.jpg
>
>
> *Data:*
> * Assume an even distribution of data across the cluster with a replication
> factor of 3.
> * In-memory data are partitioned in 128 chunks (384 cores in total, i.e. 3
> requests can be executed concurrently(-ish) )
> *Action:*
> * Action is a simple sequence of map/filter/reduce.
> * The action operates upon and returns a small subset of data (following the
> full map over the data).
> * Data are 1 x cached serialized in memory (Kryo), so calling the action
> should not hit the disk under normal conditions.
> * Action network usage is low as it returns a small number of aggregated
> results and does not require excessive shuffling
> * Under low or moderate load, each action is expected to complete in less
> than 2 seconds
> *H/W Outlook*
> When the action is called in high numbers, initially the cluster CPU gets
> close to 100% (which is expected & intended).
> After a while, the cluster utilization reduces significantly with only one
> (struggler) node having 100% CPU and fully utilized network.
> *Diagnosis:*
> 1. Attached a profiler to the driver and executors to monitor GC or I/O
> issues and everything is normal under low or heavy usage.
> 2. Cluster network usage is very low
> 3. No issues on Spark UI except that tasks begin to move from LOCAL to ANY
> *Cause (Corrected as found details in code):*
> 1. Node 'H' is doing marginally more work than the rest (being a little
> slower and at almost 100% CPU)
> 2. Scheduler hits the default 3000 millis spark.locality.wait and assigns the
> task to other nodes
> 3. One of the nodes 'X' that accepted the task will try to access the data
> from node 'H' HDD. This adds Network I/O to node and also some extra CPU for
> I/O.
> 4. 'X' time to complete increases ~5x as it goes over Network
> 5. Eventually, every node will have a task that is waiting to fetch that
> specific partition from node 'H' so cluster is basically blocked on a single
> node
> What I managed to figure out from the code is that this is because if an RDD
> is cached, it will make use of BlockManager.getRemote() and will not
> recompute the DAG part that resulted in this RDD and hence always hit the
> node that has cached the RDD.
> * Proposed Fix *
> I have not worked with Scala & Spark source code enough to propose a code
> fix, but on a high level, when a task hits the 'spark.locality.wait' timeout,
> it could make use of a new configuration e.g.
> recomputeRddAfterLocalityTimeout instead of always trying to get the cached
> RDD. This would be very useful if it could also be manually set on the RDD.
> *Workaround*
> Playing with 'spark.locality.wait' values, there is a deterministic value
> depending on partitions and config where the problem ceases to exist.
> *PS1* : Don't have enough Scala skils to follow the issue or propose a fix,
> but I hope that this has enough information to make sense.
> *PS2* : Debugging this issue made me realize that there can be a lot of
> use-cases that trigger this behaviour
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]