[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated MAPREDUCE-2349:
--------------------------------------

    Attachment: MAPREDUCE-2349.5.txt

Thanks for the review. Updated patch attached.

bq. Add the configs to mapred-default.xml as documentation?
Done.
bq. LIST_STATUS_NUM_THREADS_DEFAULT -> DEFAULT_LIST_STATUS_NUM_THREADS
Done.
bq. oldListStatus() -> singleThreadedListStatus()
Done
bq. Can you add a bit of javadoc to all the new classes and methods in 
LocatedFileStatusFetcher? Also to the main LocatedFileStatusFetcher class 
itself.
Done
bq. Synchronization needed for ProcessInitialInputPathResult.addError()?
Not required. It's local to the specific instance.
bq. Can you group the callable, result and call-back for each type of operation 
together in two classes?
Moved the Result into the callable. The CallbackHandler is non static - so 
moving that requires a fair amount of change, and additional parameters. Have 
left that as is.
bq. The 'result' variable doesn't need to be a class field of 
ProcessInputDirCallable. Similarly the one in ProcessInitialInputPathCallable.
Made this local to the method.

Also fixed a typo in one of the log messages

> speed up list[located]status calls from input formats
> -----------------------------------------------------
>
>                 Key: MAPREDUCE-2349
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2349
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>            Reporter: Joydeep Sen Sarma
>            Assignee: Siddharth Seth
>         Attachments: MAPREDUCE-2349.1.wip.txt, MAPREDUCE-2349.2.txt, 
> MAPREDUCE-2349.3.txt, MAPREDUCE-2349.4.txt, MAPREDUCE-2349.5.txt
>
>
> when a job has many input paths - listStatus - or the improved 
> listLocatedStatus - calls (invoked from the getSplits() method) can take a 
> long time. Most of the time is spent waiting for the previous call to 
> complete and then dispatching the next call. 
> This can be greatly speeded up by dispatching multiple calls at once (via 
> executors). If the same filesystem client is used - then the calls are much 
> better pipelined (since calls are serialized) and don't impose extra burden 
> on the namenode while at the same time greatly reducing the latency to the 
> client. In a simple test on non-peak hours, this resulted in the getSplits() 
> time reducing from about 3s to about 0.5s.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to