[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

mariahualiu Sat, 06 May 2017 17:04:08 -0700

Github user mariahualiu commented on the issue:

    https://github.com/apache/spark/pull/17854
  
    @squito yes, I capped the number of resources in updateResourceRequests so 
that YarnAllocator asks for less number of resources in each iteration. When 
allocation fails one iteration, the request is then added back and 
YarnAllocator will try to allocate the leftover (from the previous iteration) 
plus the new requests in the next iteration, which can result a lot of 
allocated containers. The second change, as you pointed out, is used to address 
this possibility. On a second thought, maybe it is a better solution to change 
AMRMClientImpl::allocate where it does not add all resource requests from ask 
to askList. 
    
    @tgravescs I tried reducing spark.yarn.containerLauncherMaxThreads but it 
didn't help much. My understanding is that these threads send container launch 
commands to node managers and immediately return, which is very light weight 
and can be extremely fast. Launching container on NM side is an async 
operation.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17854: [SPARK-20564][Deploy] Reduce massive executor failures w...

Reply via email to