[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17763701#comment-17763701
 ] 

ASF GitHub Bot commented on MAPREDUCE-7445:
-------------------------------------------

teamconfx opened a new pull request, #6051:
URL: https://github.com/apache/hadoop/pull/6051

   ### Description of PR
   https://issues.apache.org/jira/browse/MAPREDUCE-7445
   This PR adds a check if maxfetchfailures is 0 such that the division by zero 
is not triggered when the user wants to trigger error report every time an 
error occurs.
   
   ### How was this patch tested?
   (1) set `mapreduce.reduce.shuffle.maxfetchfailures=0, 
mapreduce.reduce.shuffle.notify.readerror=false`
   (2) run 
`org.apache.hadoop.mapreduce.task.reduce.TestShuffleScheduler#TestSucceedAndFailedCopyMap`
   The test passes rather than throwing `ArithmeticException`.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?




> ShuffleSchedulerImpl causes ArithmeticException due to improper 
> detailsInterval value checking
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7445
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7445
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 3.3.3
>            Reporter: ConfX
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: reproduce.sh
>
>
> h2. What happened
> There is no value checking for parameter 
> {{{}mapreduce.reduce.shuffle.maxfetchfailures{}}}. This may cause improper 
> calculations and crashes the system like division by 0.
> h2. Buggy code
> In {{{}ShuffleSchedulerImpl.java{}}}, there is no value checking for 
> {{maxFetchFailuresBeforeReporting}} and this variable is directly passed to 
> method {{{}checkAndInformMRAppMaster{}}}. When 
> {{maxFetchFailuresBeforeReporting }} is mistakenly set to 0, the code would 
> cause division by 0 and throw ArithmeticException to crash the system.
>  
> {noformat}
> private void checkAndInformMRAppMaster(
>      ...
>     if (connectExcpt || (reportReadErrorImmediately && readError)
>         || ((failures % maxFetchFailuresBeforeReporting) == 0) || hostFailed) 
> {
>       ...
>   }{noformat}
> h2. How to reproduce
> (1) set {{{}mapreduce.reduce.shuffle.maxfetchfailures{}}}={{{}0{}}}, 
> {{{}mapreduce.reduce.shuffle.notify.readerror{}}}={{{}false{}}}
> (2) run {{mvn surefire:test 
> -Dtest=org.apache.hadoop.mapreduce.task.reduce.TestShuffleScheduler#TestSucceedAndFailedCopyMap}}
> h2. Stacktrace
> {noformat}
> java.lang.ArithmeticException: / by zero
>     at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkAndInformMRAppMaster(ShuffleSchedulerImpl.java:347)
>     at 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:308)
>     at 
> org.apache.hadoop.mapreduce.task.reduce.TestShuffleScheduler.TestSucceedAndFailedCopyMap(TestShuffleScheduler.java:285){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to