[
https://issues.apache.org/jira/browse/HADOOP-4888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Douglas updated HADOOP-4888:
----------------------------------
Attachment: 4888-1.patch
@Zheng: You're right, I shouldn't have said "degraded."
@Steve: Thanks for the ivy settings; I hadn't started to consider that, yet.
The goal of this is identical to HADOOP-1338, really. Reimplementing the
connection pooling in Hadoop could offer some advantages (e.g. more granular
progress reporting), but appropriating all the work done in HttpClient seems
like a clear win until that work is completed.
I tried a similar, still preliminary patch, but with max connections per host
set to 1 and on a job with different parameters, i.e.
mapred.reduce.slowstart.completed.maps=1.0, 38272 maps, 448 reducers, 32MB
(generated) per map on ~300 nodes. Times measured are from the start of the
reduce (after all maps have finished, so the stragglers are not a factor) to
end of the shuffle (avg / std.d):
|| Version || 1 || 2 || 3 || 4 || 5 || avg || avg job ||
| r732838 | 786.89 / 45.55 | 842.596 / 70.69 | 1458.75 / 83.88 | 1140.93 /
44.22 | 1294.67 / 58.87 | 1104.77 | 2479.8 |
| r732838 + patch | 803.261 / 73.36 | 783.243 / 93.34 | 792.106 / 78.94 |
917.153 / 52.91 | 776.756 / 113.56 | 814.50 | 1955.2 |
Many of the parameters need to be adjusted. In particular, the timeouts are
worth revisiting, as are the number of connections and threads at the server
and client. Whether the HEAD + GET imposes a measurable penalty may also merit
consideration before this can be committed. However, the preceding demonstrates
that a measurable improvement is possible, and that this part of the pipeline
could be mined for performance improvements.
> Use Apache HttpClient for fetching map outputs
> ----------------------------------------------
>
> Key: HADOOP-4888
> URL: https://issues.apache.org/jira/browse/HADOOP-4888
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Chris Douglas
> Assignee: Chris Douglas
> Attachments: 4888-0.patch, 4888-1.patch
>
>
> It's worth experimenting with the
> [HttpClient|http://hc.apache.org/httpclient-3.x/] library to speed up the
> shuffle.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.