Victor Zhang created MAPREDUCE-7233: ---------------------------------------
Summary: MapReduce Input Path Should Ignore Path Ends With '/*' When Job Submit Key: MAPREDUCE-7233 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7233 Project: Hadoop Map/Reduce Issue Type: Improvement Components: job submission, performance Affects Versions: 2.7.2 Reporter: Victor Zhang Attachments: job submit.jpg We have a public and shared hadoop cluster that runs so many MR job from different department. I found that job submission very slow once the input path of the job set to a path ends with "/*", like "/my/path/*", but "/my/path" or "/my/path/" works fine. After read the code. I think the problem lies in the process of splits calculation. FileInputFormat#singleThreadedListStatus() method get a array of FileStatus first. If the input path ends with "/*", and the result is all file/directory FileStatus object in the input path. But only one FileStatus object(the input path) if the input path not ends with "/*". The next step is find the LocatedFileStatus of each FileStatus object. so, only the directory FileStatus do searching the LocatedFileStatus(dfs.listPaths(), batch). Finally, when calculate job split like FileInputFormat#getSplits() method. If the FileStatus is not LocatedFileStatus object, then use fs.getFileBlockLocations() method to fetch. Which could lead a lot of RPC requests when many files in the input path. CombineFileInputFormat do this also in the construction method of OneFileInfo. So, in this case, some job take a few minutes/hours to submit. I tried to remove the suffix of the input path that ends with "/*" before the code that get file status, but I don't confirm if this will cause other problems. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org