Ryan Blue created SPARK-13779:
---------------------------------

             Summary: YarnAllocator cancels and resubmits container requests 
with no locality preference
                 Key: SPARK-13779
                 URL: https://issues.apache.org/jira/browse/SPARK-13779
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 1.6.0
            Reporter: Ryan Blue


SPARK-9817 attempts to improve locality by considering the set of pending 
container requests. Pending requests with a locality preference that is no 
longer needed or no locality preference are cancelled, then resubmitted with 
updated locality preferences.

When running over data in S3, some stages have no locality information so the 
result is that the current logic cancels all pending requests and resubmits 
them (still without locality preferences) on every call to 
[`updateResourceRequests()`|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L273].

I propose the following update to avoid this problem:
1. Cancel any pending requests with stale locality preferences
2. Calculate N new requests, where N is the number of new containers + 
cancelled stale requests + outstanding requests without locality
3. If the number of new requests with a locality preference was larger than the 
available count (new containers + cancelled stale requests), then cancel enough 
requests with no locality preference to be able to submit all of the requests 
with a locality preference.
4. If the number of new requests with a locality preference is smaller than the 
available count then submit all of the locality requests and a request with no 
locality preference for the remaining available count. No pending requests with 
no locality are cancelled.

This strategy only cancels requests with no locality preference if a new 
request can be made that has a locality preference. Cancelling stale locality 
requests happens as it does today. I've tested this on large S3 jobs (50,000+ 
tasks) and it fixes the request thrashing problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to