[
https://issues.apache.org/jira/browse/SPARK-13718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182901#comment-15182901
]
Ioannis Deligiannis edited comment on SPARK-13718 at 3/7/16 11:48 AM:
----------------------------------------------------------------------
Point taken, though I'd rank it higher than minor since it severely effect
non-batch applications (Which in client application terms would be considered a
bug; not a Spark bug). In any case, do you think this would be better placed on
the mailing list?
was (Author: jid1):
Point taken, though I'd rank it higher than minor since it severely effect
non-batch applications (Which is application terms would be considered a bug).
In any case, do you think this would be better placed on the mailing list?
> Scheduler "creating" straggler node
> ------------------------------------
>
> Key: SPARK-13718
> URL: https://issues.apache.org/jira/browse/SPARK-13718
> Project: Spark
> Issue Type: Improvement
> Components: Scheduler, Spark Core
> Affects Versions: 1.3.1
> Environment: Spark 1.3.1
> MapR-FS
> Single Rack
> Standalone mode scheduling
> 8 node cluster
> 48 cores & 512 RAM / node
> Data Replication factor of 3
> Each Node has 4 Spark executors configured with 12 cores each and 22GB of RAM.
> Reporter: Ioannis Deligiannis
> Priority: Minor
>
> *Data:*
> * Assume an even distribution of data across the cluster with a replication
> factor of 3.
> * In-memory data are partitioned in 128 chunks (384 cores in total, i.e. 3
> requests can be executed concurrently(-ish) )
> *Action:*
> * Action is a simple sequence of map/filter/reduce.
> * The action operates upon and returns a small subset of data (following the
> full map over the data).
> * Data are 1 x cached serialized in memory (Kryo), so calling the action
> should not hit the disk under normal conditions.
> * Action network usage is low as it returns a small number of aggregated
> results and does not require excessive shuffling
> * Under low or moderate load, each action is expected to complete in less
> than 2 seconds
> *H/W Outlook*
> When the action is called in high numbers, initially the cluster CPU gets
> close to 100% (which is expected & intended).
> After a while, the cluster utilization reduces significantly with only one
> (struggler) node having 100% CPU and fully utilized network.
> *Diagnosis:*
> 1. Attached a profiler to the driver and executors to monitor GC or I/O
> issues and everything is normal under low or heavy usage.
> 2. Cluster network usage is very low
> 3. No issues on Spark UI except that tasks begin to move from LOCAL to ANY
> *Cause :*
> 1. Node 'H' is doing marginally more work than the rest (being a little
> slower and at almost 100% CPU)
> 2. Scheduler hits the default 3000 millis spark.locality.wait and assigns the
> task to other nodes (In some cases it will assign to NODE which means load
> from HDD and then follow the sequence and fallback to ANY)
> 3. One of the nodes 'X' that accepted the task will eventually try to access
> the data from node 'H' HDD. This adds HDD and Network I/O to node and also
> some extra CPU for I/O.
> 4. 'X' time to complete increases ~5x as it involves HDD/Network
> 5. Eventually, every node has a task that is waiting to fetch that specific
> partition from node 'H' so cluster is basically blocked on a single node
> * Proposed Fix *
> I have not worked with Scala enough to propose a code fix, but on a high
> level, when a task hits the 'spark.locality.wait' timeout, it should provide
> a 'hint' to the node accepting the task to use as a data source 'replica'
> that is not on the node that failed to accept the task in the first place.
> *Workaround*
> Playing with 'spark.locality.wait' values, there is a deterministic value
> depending on partitions and config where the problem ceases to exist.
> *PS1* : Don't have enough Scala skils to follow the issue or propose a fix,
> but I hope that this has enough information to make sense.
> *PS2* : Debugging this issue made me realize that there can be a lot of
> use-cases that trigger this behaviour
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]