Victor Zhang created MAPREDUCE-7233:
---------------------------------------

             Summary: MapReduce Input Path Should Ignore Path Ends With '/*' 
When Job Submit
                 Key: MAPREDUCE-7233
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7233
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: job submission, performance
    Affects Versions: 2.7.2
            Reporter: Victor Zhang
         Attachments: job submit.jpg

We have a public and shared hadoop cluster that runs so many MR job from 
different department.

 

I found that job submission very slow once the input path of the job set to a 
path ends with "/*", like "/my/path/*", but "/my/path" or "/my/path/" works 
fine.

 

After read the code. I think the problem lies in  the process of splits 
calculation.

 

FileInputFormat#singleThreadedListStatus() method get a array of FileStatus 
first. If the input path ends with "/*", and the result is all file/directory 
FileStatus object in the input path. But only one FileStatus object(the input 
path) if the input path not ends with "/*".

 

The next step is find the LocatedFileStatus of each FileStatus object. so, only 
the directory FileStatus do searching the LocatedFileStatus(dfs.listPaths(), 
batch).

 

Finally, when calculate job split like FileInputFormat#getSplits() method. If 
the FileStatus is not LocatedFileStatus object, then use 
fs.getFileBlockLocations() method to fetch. Which could lead a lot of RPC 
requests when many files in the input path. CombineFileInputFormat do this also 
in the construction method of OneFileInfo.

 

So, in this case, some job take a few minutes/hours to submit.

 

I tried to remove the suffix of the input path that ends with "/*" before the 
code that get file status, but I don't confirm if this will cause other 
problems.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to