[ http://issues.apache.org/jira/browse/HADOOP-195?page=all ]
paul sutter updated HADOOP-195:
-------------------------------
Attachment: MapFileSimulator.java
mapfilesimulator-sort2.txt
mapfilesimulator-big.txt
I dont have a 188 node cluster, so i wrote a simulator to test the impact of
file sizes and buffer sizes on performance for a single node in Owen's test.
The program has a simiulated map step, to create the output files using the
configured buffer size, and a copy step to copy the files.
The idea is to isolate filesystem/disk performance issues from any interaction
with RPC, TCP, switches, etc.
Results show an 8-10X speedup with larger files and buffers:
Configration "sort2":
(32MB DFS blocks, 4KB buffer, 320 mappers/node, 356 reducers total, 10GB total
data):
- map phase: 48 minutes and 45 seconds
- copy phase: 70 minutes and 38 seconds
Configuration "big":
(1GB DFS blocks, 1MB buffer, 10 mappers/node, 356 reducers total, 10GB total
data)::
- map phase: 6 minutes and 24 seconds
- copy phase, 7 minutes and 56 seconds
That final copy phase was only running at 30MB/sec, so it should be easy to
move that across the network if those 188 nodes were on one big switch.
Obviously, this is about half the speed of the bare drive, so there is another
2X improvement possible and still be able to fit within gigabit network
limitations.
Like Owen's sort test, this test generate 10GB of data per node, and the node
I'm using has 4GB of RAM. The program is attached, along with outout according
to Owens' last configuration and a run with larger files and buffers.
The program also has a configuration called "sort1" that had the original
configuraiton, but would take too long to run so i didint run it.
> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
> Key: HADOOP-195
> URL: http://issues.apache.org/jira/browse/HADOOP-195
> Project: Hadoop
> Type: Improvement
> Components: mapred
> Versions: 0.2
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.3
> Attachments: MapFileSimulator.java, data-transfer-chart.pdf,
> mapfilesimulator-big.txt, mapfilesimulator-sort2.txt, netstat.log, netstat.xls
>
> The data transfer of the map output should be transfered via http instead
> rpc, because rpc is very slow for this application and the timeout behavior
> is suboptimal. (server sends data and client ignores it because it took more
> than 10 seconds to be received.)
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira