Hi, Thanks in advance for the help!
I have a performance question relating to how fast I can expect Hadoop to scale. Running Cloudera's 0.18.3-10. I have custom binary format, which is just Google Protocol Buffer (protobuf) serialized data: 669 files, ~30GB total size (ranging 10MB to 100MB each). 128MB block size. 10 Hadoop Nodes. I tested my InputFormat and RecordReader for my input format, and it showed about 56MB/s performance (single thread, no hadoop, passed in test file via FileInputFormat instead of FSDataInputStream) on hardware similar to what I have in my cluster. I also then tested some simple Map logic along w/ the above, and got around 54MB/s. I believe that difference can be accounted for parsing the protobuf data into java objects. Anyways, when I put this logic into a job that has - no reduce (.setNumReduceTasks(0);) - no emit - just protobuf parsing calls (like above) I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. So my question, why is the rate only 2x what I see on a single thread, non-hadoop test? Would it not be: 54MB/s x 10 (Num Nodes) - small hadoop overhead ? Is there any area of my configuration I should look into for tuning? Anyway I could get more accurate performance monitoring of my job? On a side note, I tried the same job after combining the files into about 11 files (still 30GB in size), and actually saw a decrease in performance (~90MB/s). Any help is appreciated. Thanks! Will some hadoop-site.xml values: dfs.replication 3 io.file.buffer.size 65536 dfs.datanode.handler.count 3 mapred.tasktracker.map.tasks.maximum 6 dfs.namenode.handler.count 5
