[
https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248367#comment-14248367
]
Harry Brundage commented on SPARK-4732:
---------------------------------------
Seems like it would, feel free to mark as duplicate!
> All application progress on the standalone scheduler can be halted by one
> systematically faulty node
> ----------------------------------------------------------------------------------------------------
>
> Key: SPARK-4732
> URL: https://issues.apache.org/jira/browse/SPARK-4732
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.1.0, 1.2.0
> Environment: - Spark Standalone scheduler
> Reporter: Harry Brundage
>
> We've experienced several cluster wide outages caused by unexpected system
> wide faults on one of our spark workers if that worker is failing
> systematically. By systematically, I mean that every executor launched by
> that worker will definitely fail due to some reason out of Spark's control
> like the log directory disk being completely out of space, or a permissions
> error for a file that's always read during executor launch. We screw up all
> the time on our team and cause stuff like this to happen, but because of the
> way the standalone scheduler allocates resources, our cluster doesn't recover
> gracefully from these failures.
> When there are more tasks to do than executors, I am pretty sure the way the
> scheduler works is that it just waits for more resource offers and then
> allocates tasks from the queue to those resources. If an executor dies
> immediately after starting, the worker monitor process will notice that it's
> dead. The master will allocate that worker's now free cores/memory to a
> currently running application that is below its spark.cores.max, which in our
> case I've observed as usually the app that just had the executor die. A new
> executor gets spawned on the same worker that the last one just died on, gets
> allocated that one task that failed, and then the whole process fails again
> for the same systematic reason, and lather rinse repeat. This happens 10
> times or whatever the max task failure count is, and then the whole app is
> deemed a failure by the driver and shut down completely.
> This happens to us for all applications in the cluster as well. We usually
> run roughly as many cores as we have hadoop nodes. We also usually have many
> more input splits than we have tasks, which means the locality of the first
> few tasks which I believe determines where our executors run is well spread
> out over the cluster, and often covers 90-100% of nodes. This means the
> likelihood of any application getting an executor scheduled any broken node
> is quite high. After an old application goes through the above mentioned
> process and dies, the next application to start or not be at it's requested
> max capacity gets an executor scheduled on the broken node, and is promptly
> taken down as well. This happens over and over as well, to the point where
> none of our spark jobs are making any progress because of one tiny
> permissions mistake on one node.
> Now, I totally understand this is usually an "error between keyboard and
> screen" kind of situation where it is the responsibility of the people
> deploying spark to ensure it is deployed correctly. The systematic issues
> we've encountered are almost always of this nature: permissions errors, disk
> full errors, one node not getting a new spark jar from a configuration error,
> configurations being out of sync, etc. That said, disks are going to fail or
> half fail, fill up, node rot is going to ruin configurations, etc etc etc,
> and as hadoop clusters scale in size this becomes more and more likely, so I
> think its reasonable to ask that Spark be resilient to this kind of failure
> and keep on truckin'.
> I think a good simple fix would be to have applications, or the master,
> blacklist workers (not executors) at a failure count lower than the task
> failure count. This would also serve as a belt and suspenders fix for
> SPARK-4498.
> If the scheduler stopped trying to schedule on nodes that fail a lot, we
> could still make progress. These blacklist events are really important and I
> think would need to be well logged and surfaced in the UI, but I'd rather log
> and carry on than fail hard. I think the tradeoff here is that you risk
> blacklisting ever worker as well if there is something systematically wrong
> with communication or whatever else I can't imagine.
> Please let me know if I've misunderstood how the scheduler works or you need
> more information or anything like that and I'll be happy to provide.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]