Hi,
I have been exploring the feasibility of using Hadoop/HDFS to analyze
terabyte-scale scientific simulation output datasets. After a set of
initial experiments, I have a number of questions regarding (1) the
configuration setting and (2) the IO read performance.
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------------
Unlike typical Hadoop applications, post-simulation analysis usually
processes one file at a time. So I wrote a
WholeFileInputFormat/WholeFileRecordReader that reads an entire file
without parsing the content, as suggested by the Hadoop wiki FAQ.
Specifying WholeFileInputFormat as as input file format
(conf.setInputFormat(FrameInputFormat.class), I constructed a simple
MapReduce program with the sole purpose to measure how fast Hadoop/HDFS
can read data. Here is the gist of the test program:
- The map method does nothing, it returns immediately when called
- No reduce task (conf.setNumReduceTasks(0)
- JVM reused (conf.setNumTasksToExecutePerJvm(-1))
The detailed hardware/software configurations are listed below:
Hardware:
- 103 nodes, each with two 2.33GHz quad-core processors and 16 GB memory
- 1 GigE connection out of each node and connecting to a 1GigE switch in
the rack (3 racks in total)
- Each rack switch has 4 10-GigE connections to a backbone
full-bandwidth 10-GigE switch (second-tier switch)
- Software (md) RAID0 on 4 SATA disks, with a capacity of 500 GB per
node
- Raw RAID0 bulk data transfer rate around 200 MB/s (dd a 4GB file
after dropping linux vfs cache)
Software:
- 2.6.26-10smp kernel
- Hadoop 0.19.1
- Three nodes as namenode, secondary name node, and job tracker,
respectively
- Remaining 100 node as slaves, each running as both datanode and
tasktracker
Relevant hadoop-site.xml setting:
- dfs.namenode.handler.count = 100
- io.file.buffer.size = 67108864
- io.bytes.per.checksum = 67108864
- mapred.task.timeout = 1200000
- mapred.map.max.attempts = 8
- mapred.tasktracker.map.tasks.maximum = 8
- dfs.replication = 3
- toploogy.script.file.name set properly to a correct Python script
Dataset characteristics:
- Four datasets consisting of files of 1 GB, 256 MB, 64 MB, and 2 MB,
respectively
- Each dataset has 1.6 terabyte data (that is, 1600 1GB files, 6400
256MB files, etc.)
- Datasets populated into HDFS via a parallel C MPI program (linked with
libhdfs.so) running on the 100 slave nodes
- dfs block size set to be the file size (otherwise, accessing a single
file may require network data transfer)
I launched the test MapReduce job one after another (so there was no
interference) and collected the following performance results:
Dataset name, Finished in, Failed/Killed task attempts, HDFS bytes read
(Map=Total), Rack-local map tasks, Launched map tasks, data-local map
tasks
1GB file dataset, 16mins11sec, 0/382, (2,578,054,119,424), 98, 1982,
1873
256MB file dataset, 50min9sec,0/397, (7,754,295,017,472), 156, 6797,
6639
64MB file dataset,4hrs18mins21sec,394/251,(28,712,795,897,856), 153,
26245, 26068
The job for the 2MB file dataset failed to run due to the following
error:
09/03/27 21:39:58 INFO mapred.FileInputFormat: Total input paths to
process : 819200
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:2786)
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
at org.apache.hadoop.io.UTF8.writeChars(UTF8.java:274)
After running into this error, the job tracker no longer accepted jobs.
I stopped and restarted the job tracker with a larger heap size setup
(8GB). But it still didn't accept new jobs.
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------------------------------------------------
Questions:
(1) Why reading 1GB files is signfiicantly faster than reading smaller
file sizes, even though reading a 256MB file is as much a bulk transfer?
(2) Why are the reported HDFS bytes read signfiicantly higher than the
dataset size (1.6TB)? (The percentage of failed/killed tasks was much
lower than the extra bytes read.)
(3) What is the maximum number (roughly) of input paths the job tracker
can handle? (For scientific simulation output dataset, it is quite
commonplace to have hundreds of thousands to millions of files.)
(4) Even for the 1GB file dataset, considering the percentage of
data-local map tasks (94.5%), the overall end-to-end read bandwidth
(1.69GB/s) is much lower than the potential performance offered by the
hardware (200MB/s local RAID0 read performance, multiplied by 100 slave
nodes). What are the settings I should change (either in the test
MapReduce program or in the config files) to obtain better performance?
Thank you very much.
Tiankai