Martin,
You are comparing different things than in HDFS-236.
The '6x' difference you noted is for pread() with a random offset and
sequential offset (otherwise 6x difference is too small for sequential
and random access in general).
But what you are doing is comparing different read APIs, both doing
sequential I/O. IOW:
HDFS-236: 6x is between :
------------------------
a) read(random_offset, buf);
b) read(sequential_offset, buf);
(b) is measuring overhead of "positional read" in HDFS,
that could be avoided.
Your program :
--------------
a) read(buf); // "true" sequential read, if you will
b) read(sequential_offset, buf); // same of (b) above.
so, you are not doing actual random I/O.
Does this sound right?
Raghu.
Martin Mituzas wrote:
hi, all
I see there are two read in DFSInputStream:
int read(byte buf[], int off, int len)
int read(long position, byte[] buffer, int offset, int length)
And I use the following code test the read performance.
Before test I generate some files in the directory DATA_DIR, then I run this
function for some time and calculate the read throughput.
The initFiles() function is borrowed from the patch
https://issues.apache.org/jira/browse/HDFS-236.
My question is I tried above two read methods and found the throughput have
huge difference. The results are attached below. Is there something wrong
with my code ? I cann't believe there can be such big difference...
And in https://issues.apache.org/jira/browse/HDFS-236, I saw the following
performance data posted by Raghu Angadi :
Description of read Time for each read in ms
1000 native reads over block files 09.5
Random Read 10x500 10.8
Random Read without CRC 10.5
Random Read with 'seek() and read()' 12.5
Read with sequential offsets 01.7
1000 native reads without closing files 07.5
So based on this data, sequential read is about 6x faster than random read
which is reasonable, and my data seems unreasonable. Anybody provides some
comments?
Here is my test result.
with first read:
test type,read size,read ops,start time,end time,test time,real read
time,throughput
sequence read,64628740096,15778506,[2009-07-20 14:47:01 704],[2009-07-20
14:53:41 704],400,400,154.09
with second read:
test type,read size,read ops,start time,end time,test time,real read
time,throughput
sequence read,2400047104,585949,[2009-07-20 14:59:50 328],[2009-07-20
15:06:30 328],400,400,5.72
My cluster: 1 name node + 3 data nodes, replication = 3.
And my code:
private void sequenceRead(long time) throws IOException {
byte[] data = new byte[bufferSize];
Random rand = new Random();
initFiles(DATA_DIR);
long period = time * 1000;
FSDataInputStream in = null;
long totalSize = 0;
long readCount = 0;
long offset = 0;
int index = (rand.nextInt() & Integer.MAX_VALUE ) % fileList.size();
if(barrier()){
start = System.currentTimeMillis();
while(System.currentTimeMillis() - start < period){
if(in == null){
FileInfo file =
(FileInfo)fileList.get(index);
in = file.fileStream;
if(in == null){
in = fs.open(file.filePath);
file.fileStream = in;
}
index = (index ++) % fileList.size();
}
long actualSize = in.read(offset, data, 0,
bufferSize);
//long actualSize = in.read(data,0,bufferSize);
readCount ++;
if(actualSize > 0){
totalSize += actualSize;
offset += actualSize;
}
if(actualSize < bufferSize) {
//in.seek(0);
in = null;
offset = 0;
}
}
out.close();
end = System.currentTimeMillis();
for(FileInfo finfo : fileList){
if(finfo.fileStream != null)
IOUtils.closeStream(finfo.fileStream);
}
System.out.println("test type,read size,read ops,start
time,end time,test time,real read time,throughput");
String s = String.format("sequence
read,%d,%d,[%s],[%s],%d,%d,%.2f",
totalSize,
readCount,
sdf.format(new Date(start)),
sdf.format(new Date(end)),
time,
(end-start)/1000,
(double)(totalSize*1000)/(double)((end -
start)*1024*1024));
System.out.println(s);
}
}