[ http://issues.apache.org/jira/browse/HADOOP-195?page=all ]

paul sutter updated HADOOP-195:
-------------------------------

    Attachment: MapFileSimulator.java
                mapfilesimulator-sort2.txt
                mapfilesimulator-big.txt


I dont have a 188 node cluster, so i wrote a simulator to test the impact of 
file sizes and buffer sizes on performance for a single node in Owen's test. 
The program has a simiulated map step, to create the output files using the 
configured buffer size, and a copy step to copy the files. 

The idea is to isolate filesystem/disk performance issues from any interaction 
with RPC, TCP, switches, etc.

Results show an 8-10X speedup with larger files and buffers:

Configration "sort2":
(32MB DFS blocks, 4KB buffer, 320 mappers/node, 356 reducers total, 10GB total 
data): 
- map phase: 48 minutes and 45 seconds
- copy phase: 70 minutes and 38 seconds

Configuration "big":
(1GB DFS blocks, 1MB buffer, 10 mappers/node, 356 reducers total, 10GB total 
data)::
- map phase: 6 minutes and 24 seconds
- copy phase, 7 minutes and 56 seconds

That final copy phase was only running at 30MB/sec, so it should be easy to 
move that across the network if those 188 nodes were on one big switch. 
Obviously, this is about half the speed of the bare drive, so there is another 
2X improvement possible and still be able to fit within gigabit network 
limitations.

Like Owen's sort test, this test generate 10GB of data per node, and the node 
I'm using has 4GB of RAM. The program is attached, along with outout according 
to Owens' last configuration and a run with larger files and buffers. 

The program also has a configuration called "sort1" that had the original 
configuraiton, but would take too long to run so i didint run it.

> transfer map output transfer with http instead of rpc
> -----------------------------------------------------
>
>          Key: HADOOP-195
>          URL: http://issues.apache.org/jira/browse/HADOOP-195
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.3
>  Attachments: MapFileSimulator.java, data-transfer-chart.pdf, 
> mapfilesimulator-big.txt, mapfilesimulator-sort2.txt, netstat.log, netstat.xls
>
> The data transfer of the map output should be transfered via http instead 
> rpc, because rpc is very slow for this application and the timeout behavior 
> is suboptimal. (server sends data and client ignores it because it took more 
> than 10 seconds to be received.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to