Waiting for response... Thanks in advance.
Martin Mituzas wrote: > > hi, all > I see there are two read in DFSInputStream: > > int read(byte buf[], int off, int len) > int read(long position, byte[] buffer, int offset, int length) > > And I use the following code test the read performance. > Before test I generate some files in the directory DATA_DIR, then I run > this function for some time and calculate the read throughput. > The initFiles() function is borrowed from the patch > https://issues.apache.org/jira/browse/HDFS-236. > My question is I tried above two read methods (see the commented lines) > and found the throughput have huge difference. The results are attached > below. Is there something wrong with my code ? I cann't believe there can > be such big difference... > And in https://issues.apache.org/jira/browse/HDFS-236, I saw the following > performance data posted by Raghu Angadi : > > Description of read Time for each read in ms > 1000 native reads over block files 09.5 > Random Read 10x500 10.8 > Random Read without CRC 10.5 > Random Read with 'seek() and read()' 12.5 > Read with sequential offsets 01.7 > 1000 native reads without closing files 07.5 > > So based on this data, sequential read is about 6x faster than random read > which is reasonable, and my data seems unreasonable. Anybody provides some > comments? > > Here is my test result. > > with first read: > test type,read size,read ops,start time,end time,test time,real read > time,throughput > sequence read,64628740096,15778506,[2009-07-20 14:47:01 704],[2009-07-20 > 14:53:41 704],400,400,154.09 > > with second read: > test type,read size,read ops,start time,end time,test time,real read > time,throughput > sequence read,2400047104,585949,[2009-07-20 14:59:50 328],[2009-07-20 > 15:06:30 328],400,400,5.72 > > My cluster: 1 name node + 3 data nodes, replication = 3. > And my code: > > private void sequenceRead(long time) throws IOException { > > byte[] data = new byte[bufferSize]; > Random rand = new Random(); > initFiles(DATA_DIR); > long period = time * 1000; > FSDataInputStream in = null; > long totalSize = 0; > long readCount = 0; > long offset = 0; > int index = (rand.nextInt() & Integer.MAX_VALUE ) % > fileList.size(); > if(barrier()){ > start = System.currentTimeMillis(); > while(System.currentTimeMillis() - start < period){ > if(in == null){ > FileInfo file = > (FileInfo)fileList.get(index); > in = file.fileStream; > if(in == null){ > in = fs.open(file.filePath); > file.fileStream = in; > } > index = (index ++) % fileList.size(); > } > long actualSize = in.read(offset, data, 0, > bufferSize); > //long actualSize = in.read(data,0,bufferSize); > readCount ++; > > if(actualSize > 0){ > totalSize += actualSize; > offset += actualSize; > } > if(actualSize < bufferSize) { > //in.seek(0); > in = null; > offset = 0; > } > } > out.close(); > end = System.currentTimeMillis(); > > for(FileInfo finfo : fileList){ > if(finfo.fileStream != null) > IOUtils.closeStream(finfo.fileStream); > } > System.out.println("test type,read size,read ops,start > time,end time,test time,real read time,throughput"); > String s = String.format("sequence > read,%d,%d,[%s],[%s],%d,%d,%.2f", > totalSize, > readCount, > sdf.format(new Date(start)), > sdf.format(new Date(end)), > time, > (end-start)/1000, > (double)(totalSize*1000)/(double)((end - > start)*1024*1024)); > System.out.println(s); > } > } > > > > -- View this message in context: http://www.nabble.com/HDFS-random-read-performance-vs-sequential-read-performance---tp24565264p24566509.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
