HDFS random read performance vs sequential read performance ?

Martin Mituzas Mon, 20 Jul 2009 00:41:39 -0700

hi, all
I see there are two read in DFSInputStream:

int read(byte buf[], int off, int len) 
int read(long position, byte[] buffer, int offset, int length)


And I use the following code test the read performance. 
Before test I generate some files in the directory DATA_DIR, then I run this
function for some time and calculate the read throughput.
The initFiles() function is borrowed from the patch
https://issues.apache.org/jira/browse/HDFS-236.
My question is I tried above two read methods and found the throughput have
huge difference. The results are attached below. Is there something wrong
with my code ? I cann't believe there can be such big difference...
And in https://issues.apache.org/jira/browse/HDFS-236, I saw the following
performance data posted by Raghu Angadi :

Description of read      Time for each read in ms
1000 native reads over block files      09.5
Random Read 10x500      10.8
Random Read without CRC         10.5
Random Read with 'seek() and read()'    12.5
Read with sequential offsets    01.7
1000 native reads without closing files         07.5 

So based on this data, sequential read is about 6x faster than random read
which is reasonable, and my data seems unreasonable. Anybody provides some
comments?

Here is my test result.

with first read:
test type,read size,read ops,start time,end time,test time,real read
time,throughput
sequence read,64628740096,15778506,[2009-07-20 14:47:01 704],[2009-07-20
14:53:41 704],400,400,154.09

with second read:
test type,read size,read ops,start time,end time,test time,real read
time,throughput
sequence read,2400047104,585949,[2009-07-20 14:59:50 328],[2009-07-20
15:06:30 328],400,400,5.72

My cluster: 1 name node + 3 data nodes, replication = 3. 
And my code:

private void sequenceRead(long time) throws IOException {

        byte[] data = new byte[bufferSize];
        Random rand = new Random();
        initFiles(DATA_DIR);
        long period = time * 1000;
        FSDataInputStream in = null;
        long totalSize = 0;
        long readCount = 0;
        long offset = 0;
        int index = (rand.nextInt() & Integer.MAX_VALUE ) % fileList.size();
        if(barrier()){
                  start = System.currentTimeMillis();
                  while(System.currentTimeMillis() - start < period){
                          if(in == null){
                                  FileInfo file =
(FileInfo)fileList.get(index);
                                  in = file.fileStream;
                                  if(in == null){
                                          in = fs.open(file.filePath);
                                          file.fileStream = in;
                                  }
                                  index = (index ++) % fileList.size();
                          }
                          long actualSize = in.read(offset, data, 0,
bufferSize);
                          //long actualSize = in.read(data,0,bufferSize);
                          readCount ++;

                          if(actualSize > 0){ 
                                  totalSize += actualSize;
                                  offset += actualSize;
                          }
                          if(actualSize < bufferSize) {
                                  //in.seek(0);
                                  in = null;
                                  offset = 0;
                          }
                  }
                  out.close();
                  end = System.currentTimeMillis();

                  for(FileInfo finfo : fileList){
                          if(finfo.fileStream != null)
                                  IOUtils.closeStream(finfo.fileStream);
                  }
                  System.out.println("test type,read size,read ops,start
time,end time,test time,real read time,throughput");
                  String s = String.format("sequence
read,%d,%d,[%s],[%s],%d,%d,%.2f",
                                  totalSize,
                                  readCount,
                                  sdf.format(new Date(start)),
                                  sdf.format(new Date(end)),
                                  time,
                                  (end-start)/1000,
                                  (double)(totalSize*1000)/(double)((end -
start)*1024*1024));
                  System.out.println(s);
        }
  }



-- 
View this message in context: 
http://www.nabble.com/HDFS-random-read-performance-vs-sequential-read-performance---tp24565264p24565264.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

HDFS random read performance vs sequential read performance ?

Reply via email to