Hi,

Thanks in advance for the help!

I have a performance question relating to how fast I can expect Hadoop
to scale. Running Cloudera's 0.18.3-10.

I have custom binary format, which is just Google Protocol Buffer
(protobuf) serialized data:

  669 files, ~30GB total size (ranging 10MB to 100MB each).
  128MB block size.
  10 Hadoop Nodes.

I tested my InputFormat and RecordReader for my input format, and it
showed about 56MB/s performance (single thread, no hadoop, passed in
test file via FileInputFormat instead of FSDataInputStream) on
hardware similar to what I have in my cluster.
I also then tested some simple Map logic along w/ the above, and got
around 54MB/s. I believe that difference can be accounted for parsing
the protobuf data into java objects.

Anyways, when I put this logic into a job that has
  - no reduce (.setNumReduceTasks(0);)
  - no emit
  - just protobuf parsing calls (like above)

I get a finish time of 10mins, 25sec, which is about 106.24 MB/s.

So my question, why is the rate only 2x what I see on a single thread,
non-hadoop test? Would it not be:
  54MB/s x 10 (Num Nodes) - small hadoop overhead ?

Is there any area of my configuration I should look into for tuning?

Anyway I could get more accurate performance monitoring of my job?

On a side note, I tried the same job after combining the files into
about 11 files (still 30GB in size), and actually saw a decrease in
performance (~90MB/s).

Any help is appreciated. Thanks!

Will

some hadoop-site.xml values:
dfs.replication  3
io.file.buffer.size   65536
dfs.datanode.handler.count  3
mapred.tasktracker.map.tasks.maximum  6
dfs.namenode.handler.count  5

Reply via email to