[
https://issues.apache.org/jira/browse/SPARK-13779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ryan Blue updated SPARK-13779:
------------------------------
Description:
SPARK-9817 attempts to improve locality by considering the set of pending
container requests. Pending requests with a locality preference that is no
longer needed or no locality preference are cancelled, then resubmitted with
updated locality preferences.
When running over data in S3, some stages have no locality information so the
result is that the current logic cancels all pending requests and resubmits
them (still without locality preferences) on every call to
[{{updateResourceRequests()}}|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L273].
I propose the following update to avoid this problem:
# Cancel any pending requests with stale locality preferences
# Calculate N new requests, where N is the number of new containers + cancelled
stale requests + outstanding requests without locality
# If the number of new requests with a locality preference was larger than the
available count (new containers + cancelled stale requests), then cancel enough
requests with no locality preference to be able to submit all of the requests
with a locality preference.
# If the number of new requests with a locality preference is smaller than the
available count then submit all of the locality requests and a request with no
locality preference for the remaining available count. No pending requests with
no locality are cancelled.
This strategy only cancels requests with no locality preference if a new
request can be made that has a locality preference. Cancelling stale locality
requests happens as it does today. I've tested this on large S3 jobs (50,000+
tasks) and it fixes the request thrashing problem.
was:
SPARK-9817 attempts to improve locality by considering the set of pending
container requests. Pending requests with a locality preference that is no
longer needed or no locality preference are cancelled, then resubmitted with
updated locality preferences.
When running over data in S3, some stages have no locality information so the
result is that the current logic cancels all pending requests and resubmits
them (still without locality preferences) on every call to
[`updateResourceRequests()`|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L273].
I propose the following update to avoid this problem:
# Cancel any pending requests with stale locality preferences
# Calculate N new requests, where N is the number of new containers + cancelled
stale requests + outstanding requests without locality
# If the number of new requests with a locality preference was larger than the
available count (new containers + cancelled stale requests), then cancel enough
requests with no locality preference to be able to submit all of the requests
with a locality preference.
# If the number of new requests with a locality preference is smaller than the
available count then submit all of the locality requests and a request with no
locality preference for the remaining available count. No pending requests with
no locality are cancelled.
This strategy only cancels requests with no locality preference if a new
request can be made that has a locality preference. Cancelling stale locality
requests happens as it does today. I've tested this on large S3 jobs (50,000+
tasks) and it fixes the request thrashing problem.
> YarnAllocator cancels and resubmits container requests with no locality
> preference
> ----------------------------------------------------------------------------------
>
> Key: SPARK-13779
> URL: https://issues.apache.org/jira/browse/SPARK-13779
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.6.0
> Reporter: Ryan Blue
>
> SPARK-9817 attempts to improve locality by considering the set of pending
> container requests. Pending requests with a locality preference that is no
> longer needed or no locality preference are cancelled, then resubmitted with
> updated locality preferences.
> When running over data in S3, some stages have no locality information so the
> result is that the current logic cancels all pending requests and resubmits
> them (still without locality preferences) on every call to
> [{{updateResourceRequests()}}|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L273].
> I propose the following update to avoid this problem:
> # Cancel any pending requests with stale locality preferences
> # Calculate N new requests, where N is the number of new containers +
> cancelled stale requests + outstanding requests without locality
> # If the number of new requests with a locality preference was larger than
> the available count (new containers + cancelled stale requests), then cancel
> enough requests with no locality preference to be able to submit all of the
> requests with a locality preference.
> # If the number of new requests with a locality preference is smaller than
> the available count then submit all of the locality requests and a request
> with no locality preference for the remaining available count. No pending
> requests with no locality are cancelled.
> This strategy only cancels requests with no locality preference if a new
> request can be made that has a locality preference. Cancelling stale locality
> requests happens as it does today. I've tested this on large S3 jobs (50,000+
> tasks) and it fixes the request thrashing problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]