hi, all I see there are two read in DFSInputStream: int read(byte buf[], int off, int len) int read(long position, byte[] buffer, int offset, int length)
And I use the following code test the read performance. Before test I generate some files in the directory DATA_DIR, then I run this function for some time and calculate the read throughput. The initFiles() function is borrowed from the patch https://issues.apache.org/jira/browse/HDFS-236. My question is I tried above two read methods and found the throughput have huge difference. The results are attached below. Is there something wrong with my code ? I cann't believe there can be such big difference... And in https://issues.apache.org/jira/browse/HDFS-236, I saw the following performance data posted by Raghu Angadi : Description of read Time for each read in ms 1000 native reads over block files 09.5 Random Read 10x500 10.8 Random Read without CRC 10.5 Random Read with 'seek() and read()' 12.5 Read with sequential offsets 01.7 1000 native reads without closing files 07.5 So based on this data, sequential read is about 6x faster than random read which is reasonable, and my data seems unreasonable. Anybody provides some comments? Here is my test result. with first read: test type,read size,read ops,start time,end time,test time,real read time,throughput sequence read,64628740096,15778506,[2009-07-20 14:47:01 704],[2009-07-20 14:53:41 704],400,400,154.09 with second read: test type,read size,read ops,start time,end time,test time,real read time,throughput sequence read,2400047104,585949,[2009-07-20 14:59:50 328],[2009-07-20 15:06:30 328],400,400,5.72 My cluster: 1 name node + 3 data nodes, replication = 3. And my code: private void sequenceRead(long time) throws IOException { byte[] data = new byte[bufferSize]; Random rand = new Random(); initFiles(DATA_DIR); long period = time * 1000; FSDataInputStream in = null; long totalSize = 0; long readCount = 0; long offset = 0; int index = (rand.nextInt() & Integer.MAX_VALUE ) % fileList.size(); if(barrier()){ start = System.currentTimeMillis(); while(System.currentTimeMillis() - start < period){ if(in == null){ FileInfo file = (FileInfo)fileList.get(index); in = file.fileStream; if(in == null){ in = fs.open(file.filePath); file.fileStream = in; } index = (index ++) % fileList.size(); } long actualSize = in.read(offset, data, 0, bufferSize); //long actualSize = in.read(data,0,bufferSize); readCount ++; if(actualSize > 0){ totalSize += actualSize; offset += actualSize; } if(actualSize < bufferSize) { //in.seek(0); in = null; offset = 0; } } out.close(); end = System.currentTimeMillis(); for(FileInfo finfo : fileList){ if(finfo.fileStream != null) IOUtils.closeStream(finfo.fileStream); } System.out.println("test type,read size,read ops,start time,end time,test time,real read time,throughput"); String s = String.format("sequence read,%d,%d,[%s],[%s],%d,%d,%.2f", totalSize, readCount, sdf.format(new Date(start)), sdf.format(new Date(end)), time, (end-start)/1000, (double)(totalSize*1000)/(double)((end - start)*1024*1024)); System.out.println(s); } } -- View this message in context: http://www.nabble.com/HDFS-random-read-performance-vs-sequential-read-performance---tp24565264p24565264.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
