Re: HDFS random read performance vs sequential read performance ?

Raghu Angadi Mon, 20 Jul 2009 11:01:49 -0700

Martin,

You are comparing different things than in HDFS-236.

The '6x' difference you noted is for pread() with a random offset andsequential offset (otherwise 6x difference is too small for sequentialand random access in general).

But what you are doing is comparing different read APIs, both doingsequential I/O. IOW:


HDFS-236: 6x is between :
------------------------
 a) read(random_offset, buf);
 b) read(sequential_offset, buf);
   (b) is measuring overhead of "positional read" in HDFS,
       that could be avoided.

Your program :
--------------
  a) read(buf); // "true" sequential read, if you will
  b) read(sequential_offset, buf); // same of (b) above.
    so, you are not doing actual random I/O.

Does this sound right?

Raghu.

Martin Mituzas wrote:

hi, all
I see there are two read in DFSInputStream:

int read(byte buf[], int off, int len)int read(long position, byte[] buffer, int offset, int length)

And I use the following code test the read performance.Before test I generate some files in the directory DATA_DIR, then I run this

function for some time and calculate the read throughput.
The initFiles() function is borrowed from the patch
https://issues.apache.org/jira/browse/HDFS-236.
My question is I tried above two read methods and found the throughput have
huge difference. The results are attached below. Is there something wrong
with my code ? I cann't believe there can be such big difference...
And in https://issues.apache.org/jira/browse/HDFS-236, I saw the following
performance data posted by Raghu Angadi :

Description of read      Time for each read in ms
1000 native reads over block files      09.5
Random Read 10x500      10.8
Random Read without CRC         10.5
Random Read with 'seek() and read()'    12.5
Read with sequential offsets    01.7

1000 native reads without closing files 07.5

So based on this data, sequential read is about 6x faster than random read
which is reasonable, and my data seems unreasonable. Anybody provides some
comments?

Here is my test result.

with first read:
test type,read size,read ops,start time,end time,test time,real read
time,throughput
sequence read,64628740096,15778506,[2009-07-20 14:47:01 704],[2009-07-20
14:53:41 704],400,400,154.09

with second read:
test type,read size,read ops,start time,end time,test time,real read
time,throughput
sequence read,2400047104,585949,[2009-07-20 14:59:50 328],[2009-07-20
15:06:30 328],400,400,5.72

My cluster: 1 name node + 3 data nodes, replication = 3.And my code:


private void sequenceRead(long time) throws IOException {

        byte[] data = new byte[bufferSize];
        Random rand = new Random();
        initFiles(DATA_DIR);
        long period = time * 1000;
        FSDataInputStream in = null;
        long totalSize = 0;
        long readCount = 0;
        long offset = 0;
        int index = (rand.nextInt() & Integer.MAX_VALUE ) % fileList.size();
        if(barrier()){
                  start = System.currentTimeMillis();
                  while(System.currentTimeMillis() - start < period){
                          if(in == null){
                                  FileInfo file =
(FileInfo)fileList.get(index);
                                  in = file.fileStream;
                                  if(in == null){
                                          in = fs.open(file.filePath);
                                          file.fileStream = in;
                                  }
                                  index = (index ++) % fileList.size();
                          }
                          long actualSize = in.read(offset, data, 0,
bufferSize);
                          //long actualSize = in.read(data,0,bufferSize);
                          readCount ++;

if(actualSize > 0){totalSize += actualSize;

                                  offset += actualSize;
                          }
                          if(actualSize < bufferSize) {
                                  //in.seek(0);
                                  in = null;
                                  offset = 0;
                          }
                  }
                  out.close();
                  end = System.currentTimeMillis();

                  for(FileInfo finfo : fileList){
                          if(finfo.fileStream != null)
                                  IOUtils.closeStream(finfo.fileStream);
                  }
                  System.out.println("test type,read size,read ops,start
time,end time,test time,real read time,throughput");
                  String s = String.format("sequence
read,%d,%d,[%s],[%s],%d,%d,%.2f",
                                  totalSize,
                                  readCount,
                                  sdf.format(new Date(start)),
                                  sdf.format(new Date(end)),
                                  time,
                                  (end-start)/1000,
                                  (double)(totalSize*1000)/(double)((end -
start)*1024*1024));
                  System.out.println(s);
        }
  }

Re: HDFS random read performance vs sequential read performance ?

Reply via email to