speed up list[located]status calls from input formats
-----------------------------------------------------
Key: MAPREDUCE-2349
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2349
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Reporter: Joydeep Sen Sarma
when a job has many input paths - listStatus - or the improved
listLocatedStatus - calls (invoked from the getSplits() method) can take a long
time. Most of the time is spent waiting for the previous call to complete and
then dispatching the next call.
This can be greatly speeded up by dispatching multiple calls at once (via
executors). If the same filesystem client is used - then the calls are much
better pipelined (since calls are serialized) and don't impose extra burden on
the namenode while at the same time greatly reducing the latency to the client.
In a simple test on non-peak hours, this resulted in the getSplits() time
reducing from about 3s to about 0.5s.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira