Thanks for the comments, Matei. The machines I ran the experiments have 16 GB memory each. I don't see how 64 MB buffer could be huge or is bad for memory consumption. In fact, I set it to much larger value after initial rounds of tests showed abysmal results using the default 64 KB buffer. Also, why is it necessary to compute checksum for every 512 bytes why only an end-to-end (whole file) checksum makes sense? I set it to a much larger value to avoid the overhead.
I didn't quite understand what you meant by bad for cache locality. The jobs were IO bound in the first place. Any cache effect came second---at least an order of magnitude negligible. Can you clarify which particular computation (maybe within Hadoop) that was made slow because of a large io buffer size? What bothered you was exactly what bothered me and prompted me to ask the question why the job tracker reported much more bytes read by the map task. I can confirm that the experiments were set up correctly. In fact, the numbers of map tasks were correctly reported by the job tracker. There were 1600 for the 1GB file dataset, 6400 for the 256MB file dataset, and so forth. Tiankai -----Original Message----- From: Matei Zaharia [mailto:[email protected]] Sent: Friday, April 03, 2009 11:21 AM To: [email protected] Subject: Re: Hadoop/HDFS for scientific simulation output data analysis Hi Tiankai, The one strange thing I see in your configuration as described is IO buffer size and IO bytes per checksum set to 64 MB. This is much higher than the recommended defaults, which are about 64 KB for buffer size and 1 KB or 512 bytes for checksum. (Actually I haven't seen anyone change checksum from its default of 512 bytes). Having huge buffers is bad for memory consumption and cache locality. The other thing that bothers me is that on your 64 MB data set, you have 28 TB of HDFS bytes read. This is off from number of map tasks * bytes per map by an order of magnitude. Are you sure that you've generated the data set correctly and that it's the only input path given to your job? Does bin/hadoop dfs -dus <path to dataset> come out as 1.6 TB? Matei On Sat, Mar 28, 2009 at 4:10 PM, Tu, Tiankai <[email protected]>wrote: > Hi, > > I have been exploring the feasibility of using Hadoop/HDFS to analyze > terabyte-scale scientific simulation output datasets. After a set of > initial experiments, I have a number of questions regarding (1) the > configuration setting and (2) the IO read performance. > > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > -------------------------------------------------------------- > Unlike typical Hadoop applications, post-simulation analysis usually > processes one file at a time. So I wrote a > WholeFileInputFormat/WholeFileRecordReader that reads an entire file > without parsing the content, as suggested by the Hadoop wiki FAQ. > > Specifying WholeFileInputFormat as as input file format > (conf.setInputFormat(FrameInputFormat.class), I constructed a simple > MapReduce program with the sole purpose to measure how fast Hadoop/HDFS > can read data. Here is the gist of the test program: > > - The map method does nothing, it returns immediately when called > - No reduce task (conf.setNumReduceTasks(0) > - JVM reused (conf.setNumTasksToExecutePerJvm(-1)) > > The detailed hardware/software configurations are listed below: > > Hardware: > - 103 nodes, each with two 2.33GHz quad-core processors and 16 GB memory > - 1 GigE connection out of each node and connecting to a 1GigE switch in > the rack (3 racks in total) > - Each rack switch has 4 10-GigE connections to a backbone > full-bandwidth 10-GigE switch (second-tier switch) > - Software (md) RAID0 on 4 SATA disks, with a capacity of 500 GB per > node > - Raw RAID0 bulk data transfer rate around 200 MB/s (dd a 4GB file > after dropping linux vfs cache) > > Software: > - 2.6.26-10smp kernel > - Hadoop 0.19.1 > - Three nodes as namenode, secondary name node, and job tracker, > respectively > - Remaining 100 node as slaves, each running as both datanode and > tasktracker > > Relevant hadoop-site.xml setting: > - dfs.namenode.handler.count = 100 > - io.file.buffer.size = 67108864 > - io.bytes.per.checksum = 67108864 > - mapred.task.timeout = 1200000 > - mapred.map.max.attempts = 8 > - mapred.tasktracker.map.tasks.maximum = 8 > - dfs.replication = 3 > - toploogy.script.file.name set properly to a correct Python script > > Dataset characteristics: > > - Four datasets consisting of files of 1 GB, 256 MB, 64 MB, and 2 MB, > respectively > - Each dataset has 1.6 terabyte data (that is, 1600 1GB files, 6400 > 256MB files, etc.) > - Datasets populated into HDFS via a parallel C MPI program (linked with > libhdfs.so) running on the 100 slave nodes > - dfs block size set to be the file size (otherwise, accessing a single > file may require network data transfer) > > I launched the test MapReduce job one after another (so there was no > interference) and collected the following performance results: > > Dataset name, Finished in, Failed/Killed task attempts, HDFS bytes read > (Map=Total), Rack-local map tasks, Launched map tasks, data-local map > tasks > > 1GB file dataset, 16mins11sec, 0/382, (2,578,054,119,424), 98, 1982, > 1873 > 256MB file dataset, 50min9sec,0/397, (7,754,295,017,472), 156, 6797, > 6639 > 64MB file dataset,4hrs18mins21sec,394/251,(28,712,795,897,856), 153, > 26245, 26068 > > The job for the 2MB file dataset failed to run due to the following > error: > > 09/03/27 21:39:58 INFO mapred.FileInputFormat: Total input paths to > process : 819200 > java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOf(Arrays.java:2786) > at > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71) > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) > at org.apache.hadoop.io.UTF8.writeChars(UTF8.java:274) > > After running into this error, the job tracker no longer accepted jobs. > I stopped and restarted the job tracker with a larger heap size setup > (8GB). But it still didn't accept new jobs. > > ------------------------------------------------------------------------ > ------------------------------------------------------------------------ > -------------------------------------------------------------- > Questions: > > (1) Why reading 1GB files is signfiicantly faster than reading smaller > file sizes, even though reading a 256MB file is as much a bulk transfer? > > (2) Why are the reported HDFS bytes read signfiicantly higher than the > dataset size (1.6TB)? (The percentage of failed/killed tasks was much > lower than the extra bytes read.) > > (3) What is the maximum number (roughly) of input paths the job tracker > can handle? (For scientific simulation output dataset, it is quite > commonplace to have hundreds of thousands to millions of files.) > > (4) Even for the 1GB file dataset, considering the percentage of > data-local map tasks (94.5%), the overall end-to-end read bandwidth > (1.69GB/s) is much lower than the potential performance offered by the > hardware (200MB/s local RAID0 read performance, multiplied by 100 slave > nodes). What are the settings I should change (either in the test > MapReduce program or in the config files) to obtain better performance? > > Thank you very much. > > Tiankai > > > >
