[ https://issues.apache.org/jira/browse/MAPREDUCE-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anthony Hsu updated MAPREDUCE-7131: ----------------------------------- Status: Open (was: Patch Available) > Job History Server has race condition where it moves files from intermediate > to finished but thinks file is in intermediate > --------------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-7131 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7131 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.7.4 > Reporter: Anthony Hsu > Assignee: Anthony Hsu > Priority: Major > Attachments: MAPREDUCE-7131.1.patch, MAPREDUCE-7131.2.patch > > > This is the race condition that can occur: > # during the first *scanIntermediateDirectory()*, > *HistoryFileInfo.moveToDone()* is scheduled for job j1 > # during the second *scanIntermediateDirectory()*, j1 is found again and put > in the *fileStatusList* to process > # *HistoryFileInfo.moveToDone()* is processed in another thread and history > files are moved to the finished directory > # the *HistoryFileInfo* for j1 is removed from *jobListCache* > # the j1 in *fileStatusList* is processed and a new *HistoryFileInfo* for j1 > is created (history, conf, and summary files will point to the intermediate > user directory, and state will be IN_INTERMEDIATE) and added to the > *jobListCache* > # *moveToDone()* is scheduled for this new j1 > # *moveToDone()* fails during *moveToDoneNow()* for the history file because > the source path in the intermediate directory does not exist > From this point on, while the new j1 *HistoryFileInfo* is in the > *jobListCache*, the JobHistoryServer will think the history file is in the > intermediate directory. If a user queries this job in the JobHistoryServer > UI, they will get > {code} > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Could not load > history file > <scheme>://<host>:<port>/mr-history/intermediate/<user>/job_1529348381246_27275711-1535123223269-<user>-<jobname>-1535127026668-1-0-SUCCEEDED-<queue>-1535126980787.jhist > {code} > Noticed this issue while running 2.7.4, but the race condition seems to still > exist in trunk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org