Sorry for the previous post. I haven't finished. Please skip it.

Hi all,
I've made some experiments on Hadoop on Amazon EC2.
I would like to share the result and any feedback would be appreciated.

Environment:
-Xen VM (Amazon EC2 instance ami-ee53b687)
-1.7Ghz Xeon CPU, 1.75GB of RAM, 160GB of local disk, and 250Mb/s of network
bandwidth (small instance)
-Hadoop 0.17.0
-storage: HDFS
-Test example: wordcount

Experiment 1: (fixed # of instances (8), variant data size (2MB~512MB), # of
maps: 8, # of reduces: 8)
Data Size(MB) | Time(s)
512          |  124
256          |  70
128          |  41
...
8            |  22
4            |  17
2            |  21

The purpose is to observe the lowest framework overhead for wordcount.
As the result, when the data size is between 2MB to 16MB, the time is around
20 second.
May I conclude the lowest framework overhead for wordcount is 20s?

Experiment 2: (variant # of instances (2~32), variant data size (128MB~2GB),
# of maps: (2-32), # of reduces: (2-32))
Data Size(MB) | Map | Reduce | Time(s)
2048         | 32  | 32     | 140
1024         | 16   | 16    | 120
512          | 8    | 8    | 124
256          | 4    | 4    | 127
128          | 2    | 2    | 119

The purpose is to observe if each instance be allocated the same blocks of
data, the time will be similar.
As the result, when the data size is between 128MB to 1024MB, the time is
around 120 seconds.
The time is 140s when data size is 2048MB. I think the reason is more data
to process would cause more overhead.

Experiment 3: (variant # of instances (2~16), fixed data size (128MB), # of
maps: (2-16), # of reduces: (2-16))
Data Size(MB) | Map | Reduce | Time(s)
128          | 16   | 16    | 31
128          | 8    | 8    | 41
128          | 4    | 4    | 69
128          | 2    | 2    | 119

The purpose is to observe for fixed data, add more and more instances, how
would the result change?
As the result, as the instances double, the time would be smaller but not
the half.
There is always the framework overhead even give infinite instances.

In fact, I did more experiments, but I just post some results.
Interestingly, I discover a formula for wordcount by my experiment result.
That is: Time(s) ~= 20+((DataSize - 8MB)*1.6 / (# of instance))
I've check the formula by all my experiment result and almost all is
matched.
Maybe it's coincidental or I have something wrong.
Anyway, I just want to share my experience and any feedback would be
appreciated.

-- 
Best Regards,
Shawn

Reply via email to