[
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated TEZ-2882:
----------------------------------
Attachment: TEZ-2882.4.patch
- Configs - annotate as private, unstable. Also fraction instead of percentage
in the name
-- Fixed.
- int threshold = (int) Math.max(1, numUniqueHosts * hostFailurePercentage); -
should the right argument be a ceil(,).
Same line - The threshold calculation should probably be different for smaller
clusters. 1 node should never be enough. Min limit of 3? . Can (-1) be treated
as a special value for the threshold configuration parameter, to completely
disable this check ?
-- Fixed. Disable check is added to isFetcherHealthy()
- LOG.info("numUniqueHosts=" + numUniqueHosts ... - Requires more context. Does
the input name need to be logged, or is that part of the thread name. Logging
which srcAttempt triggered this log line would be useful as well.
- Similarly for the log line in isFetcherHealthy
-- Fixed. LogContext added which has info on srcAttempt. Thread info is
available in log already which contains the vertex details.
- if(pathToIdentifierMap.size() == numInputs) { <- Does this break for
pipelined shuffle ?
-- Added allEventsReceived() to handle this. shuffleInfoEventsMap holds the
details for pipelined shuffle case. Also in normal case, it needs to be
(pathToIdentifierMap + skippedInputCounter) as it needs to handle empty
partitions case as well.
- I believe this condition is meant to measure pendingHosts instead of
remainingMaps.get() (considering parallel instances of a host are never
scheduled). However, remainingMaps works better because of the way the Fetcher
reports failures. If this is correct - could you please add a comment along
with the condition. Fetcher reporting failures needs to be looked at along with
any modifications to this code.
-- Fixed.
- Same condition - do we need to special case only a single input remaining ?
- healthySinceLastUpdate - This will almost always be true, given the previous
condition.
- Should failedAcrossNodes be outside of the condition "failedShuffle >
remaining * allowed)
-- failedAcrossNodes check has been moved up in the chain. With this, we do not
need to special case for single input remaining. Also, set the
minFailurePerHost limit to 4. Couple of cases here.
--- 1) numInputs = 1 & errors happen. If so, it would be handled by
hasFailedAcrossNodes() and fetching would be declared unhealthy. Along with
this, stall duration is included to kill the consumer.
--- 2) numInputs > 1, but remaining item to be fetched is 1 (e.g 319/320). If
lots of errors were seen, MAX_ALLOWED_FAILED_FETCH_ATTEMPT_PERCENT
(maxAllowedFailedFetchFraction) would be triggered and fetching would be
declared unhealthy. Otherise, it would enter
"(failedShufflesSinceLastCompletion >= remainingMaps.get() *
minFailurePerHost)" condition to mark fetching unhealthy. This is expected
behavior when 319/320 are complete. Also, it would have enter this condition
after trying out multiple times. So it is safe to mark fetching as unhealthy.
Corner case: If lots of errors are NOT seen, and if number of errors seen since
last update crosses minFailurePerHost * remaining, "fetching" would be marked
as unhealthy. 4*180 = 720 seconds should be good for this, as AM would have
already restarted the producer after 300 seconds of deadline.
--- 3) numInputs > 1, but remaining items to be fetched is > 1 (e.g 300/320).
Same as previous case, except that it would wait for some more retries to
happen before declaring "fetching" as unhealthy.
--- 4) corner case. numInputs > 1, remaining items to be fetched is > 1 (e.g
300/320); but all items are present in single source. In this case, it would
take a little longer time for "fetching" to be declared as unhealthy. Node
failure checks would fail in this scenario and it is left to the condition to
do the error detection.
- Config param for abortFailureLimit ? (-1 defaults to code, anything else is
an absolute value). Reducing it to 15 only helps for smaller jobs. numInputs/10
will take over otherwise.
-- Added TEZ_RUNTIME_SHUFFLE_SOURCE_ATTEMPT_ABORT_LIMIT. Setting to -1 defaults
to code.
- Nit: "Preconditions.checkArgument(uniqueHosts.size() > 0, "No values in
unique hosts");" - can be after numUniqueHosts
-- Fixed
- Nit: There's a TODO which needs to be removed
-- Fixed
- renamed hasIndividualAttemptExceeded to
"isAbortLimitExceeedFor(InputAttemptIdentifier srcAttempt) ". Added error msg
and removed fillInStackTrace(unrelated to this)/
- Added configs for
TEZ_RUNTIME_SHUFFLE_MAX_ALLOWED_FAILED_FETCH_ATTEMPT_FRACTION,
TEZ_RUNTIME_SHUFFLE_MIN_REQUIRED_PROGRESS_FRACTION,TEZ_RUNTIME_SHUFFLE_MAX_STALL_TIME_FRACTION,
TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST,
TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION,
TEZ_RUNTIME_SHUFFLE_SOURCE_ATTEMPT_ABORT_LIMIT
- renamed shuffleInfoEventsMap to pipelinedShuffleInfoEventsMap
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch,
> TEZ-2882.4.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)