[ 
https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2882:
----------------------------------
    Attachment: TEZ-2882.4.patch

- Configs - annotate as private, unstable. Also fraction instead of percentage 
in the name
-- Fixed.

- int threshold = (int) Math.max(1, numUniqueHosts * hostFailurePercentage); - 
should the right argument be a ceil(,).
Same line - The threshold calculation should probably be different for smaller 
clusters. 1 node should never be enough. Min limit of 3? . Can (-1) be treated 
as a special value for the threshold configuration parameter, to completely 
disable this check ?
-- Fixed. Disable check is added to isFetcherHealthy()

- LOG.info("numUniqueHosts=" + numUniqueHosts ... - Requires more context. Does 
the input name need to be logged, or is that part of the thread name. Logging 
which srcAttempt triggered this log line would be useful as well.
- Similarly for the log line in isFetcherHealthy
-- Fixed. LogContext added which has info on srcAttempt. Thread info is 
available in log already which contains the vertex details.

- if(pathToIdentifierMap.size() == numInputs) { <- Does this break for 
pipelined shuffle ?
-- Added allEventsReceived() to handle this. shuffleInfoEventsMap holds the 
details for pipelined shuffle case. Also in normal case, it needs to be 
(pathToIdentifierMap + skippedInputCounter) as it needs to handle empty 
partitions case as well.

- I believe this condition is meant to measure pendingHosts instead of 
remainingMaps.get() (considering parallel instances of a host are never 
scheduled). However, remainingMaps works better because of the way the Fetcher 
reports failures. If this is correct - could you please add a comment along 
with the condition. Fetcher reporting failures needs to be looked at along with 
any modifications to this code.
-- Fixed.

- Same condition - do we need to special case only a single input remaining ?
- healthySinceLastUpdate - This will almost always be true, given the previous 
condition.
- Should failedAcrossNodes be outside of the condition "failedShuffle > 
remaining * allowed)
-- failedAcrossNodes check has been moved up in the chain. With this, we do not 
need to special case for single input remaining. Also, set the 
minFailurePerHost limit to 4. Couple of cases here. 
--- 1) numInputs = 1 & errors happen. If so, it would be handled by 
hasFailedAcrossNodes() and fetching would be declared unhealthy. Along with 
this, stall duration is included to kill the consumer.
--- 2) numInputs > 1, but remaining item to be fetched is 1 (e.g 319/320). If 
lots of errors were seen, MAX_ALLOWED_FAILED_FETCH_ATTEMPT_PERCENT 
(maxAllowedFailedFetchFraction) would be triggered and fetching would be 
declared unhealthy. Otherise, it would enter 
"(failedShufflesSinceLastCompletion >= remainingMaps.get() * 
minFailurePerHost)" condition to mark fetching unhealthy.  This is expected 
behavior when 319/320 are complete. Also, it would have enter this condition 
after trying out multiple times. So it is safe to mark fetching as unhealthy. 
Corner case: If lots of errors are NOT seen, and if number of errors seen since 
last update crosses minFailurePerHost * remaining, "fetching" would be marked 
as unhealthy. 4*180 = 720 seconds should be good for this, as AM would have 
already restarted the producer after 300 seconds of deadline.
--- 3) numInputs > 1, but remaining items to be fetched is > 1 (e.g 300/320). 
Same as previous case, except that it would wait for some more retries to 
happen before declaring "fetching" as unhealthy.
--- 4) corner case. numInputs > 1, remaining items to be fetched is > 1 (e.g 
300/320); but all items are present in single source.  In this case, it would 
take a little longer time for "fetching" to be declared as unhealthy. Node 
failure checks would fail in this scenario and it is left to the condition to 
do the error detection.


- Config param for abortFailureLimit ? (-1 defaults to code, anything else is 
an absolute value). Reducing it to 15 only helps for smaller jobs. numInputs/10 
will take over otherwise.
-- Added TEZ_RUNTIME_SHUFFLE_SOURCE_ATTEMPT_ABORT_LIMIT. Setting to -1 defaults 
to code.

- Nit: "Preconditions.checkArgument(uniqueHosts.size() > 0, "No values in 
unique hosts");" - can be after numUniqueHosts
-- Fixed

- Nit: There's a TODO which needs to be removed
-- Fixed

- renamed hasIndividualAttemptExceeded to 
"isAbortLimitExceeedFor(InputAttemptIdentifier srcAttempt) ". Added error msg 
and removed fillInStackTrace(unrelated to this)/

- Added configs for 
TEZ_RUNTIME_SHUFFLE_MAX_ALLOWED_FAILED_FETCH_ATTEMPT_FRACTION, 
TEZ_RUNTIME_SHUFFLE_MIN_REQUIRED_PROGRESS_FRACTION,TEZ_RUNTIME_SHUFFLE_MAX_STALL_TIME_FRACTION,
 TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST, 
TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION, 
TEZ_RUNTIME_SHUFFLE_SOURCE_ATTEMPT_ABORT_LIMIT

- renamed shuffleInfoEventsMap to pipelinedShuffleInfoEventsMap

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch, 
> TEZ-2882.4.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to