Re: FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread Raghu Angadi


This seems to be the case. I don't think there is any specific reason 
not to read across the block boundary...


Even if HDFS does read across the blocks, it is still not a good idea to 
ignore the JavaDoc for read(). If you want all the bytes read, then you 
should have a while loop or one of the readFully() variants. For e.g. if 
you later change your code by wrapping a BufferedInputStream around 
'in', you would still get partial reads even if HDFS reads all the data.


Raghu.

forbbs forbbs wrote:

The hadoop version is 0.19.0.
My file is larger than 64MB, and the block size is 64MB.

The output of the code below is '10'. May I read across the block
boundary?  Or I should use 'while (left..){}' style code?

 public static void main(String[] args) throws IOException
  {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FSDataInputStream fin = fs.open(new Path(args[0]));

fin.seek(64*1024*1024 - 10);
byte[] buffer = new byte[32*1024];
int len = fin.read(buffer);
//int len = fin.read(buffer, 0, 128);
System.out.println(len);

fin.close();
  }




Re: FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread Matei Zaharia
This kind of partial read is often used by the OS to return to your
application as soon as possible if trying to read more data would block, in
case you can begin computing on the partial data. In some applications, it's
not useful, but when you can begin computing on partial data, it allows the
OS to overlap IO with your computation, improving throughput. I think
FSDataInputStream returns at the block boundary for the same reason.

On Sun, Jun 28, 2009 at 11:16 AM, Raghu Angadi rang...@yahoo-inc.comwrote:


 This seems to be the case. I don't think there is any specific reason not
 to read across the block boundary...

 Even if HDFS does read across the blocks, it is still not a good idea to
 ignore the JavaDoc for read(). If you want all the bytes read, then you
 should have a while loop or one of the readFully() variants. For e.g. if you
 later change your code by wrapping a BufferedInputStream around 'in', you
 would still get partial reads even if HDFS reads all the data.

 Raghu.


 forbbs forbbs wrote:

 The hadoop version is 0.19.0.
 My file is larger than 64MB, and the block size is 64MB.

 The output of the code below is '10'. May I read across the block
 boundary?  Or I should use 'while (left..){}' style code?

  public static void main(String[] args) throws IOException
  {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FSDataInputStream fin = fs.open(new Path(args[0]));

fin.seek(64*1024*1024 - 10);
byte[] buffer = new byte[32*1024];
int len = fin.read(buffer);
//int len = fin.read(buffer, 0, 128);
System.out.println(len);

fin.close();
  }





Re: FSDataInputStream.read(byte[]) only reads to a block boundary?

2009-06-28 Thread M. C. Srivas
On Sun, Jun 28, 2009 at 3:01 PM, Matei Zaharia ma...@cloudera.com wrote:

 This kind of partial read is often used by the OS to return to your
 application as soon as possible if trying to read more data would block, in
 case you can begin computing on the partial data. In some applications,
 it's
 not useful, but when you can begin computing on partial data, it allows the
 OS to overlap IO with your computation, improving throughput. I think
 FSDataInputStream returns at the block boundary for the same reason.


It is very unusual, nay, unexpected to the point of bizarre, for the OS to
do so on a regular file. Typically only seen on network fds.




 On Sun, Jun 28, 2009 at 11:16 AM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

 
  This seems to be the case. I don't think there is any specific reason not
  to read across the block boundary...
 
  Even if HDFS does read across the blocks, it is still not a good idea to
  ignore the JavaDoc for read(). If you want all the bytes read, then you
  should have a while loop or one of the readFully() variants. For e.g. if
 you
  later change your code by wrapping a BufferedInputStream around 'in', you
  would still get partial reads even if HDFS reads all the data.
 
  Raghu.
 
 
  forbbs forbbs wrote:
 
  The hadoop version is 0.19.0.
  My file is larger than 64MB, and the block size is 64MB.
 
  The output of the code below is '10'. May I read across the block
  boundary?  Or I should use 'while (left..){}' style code?
 
   public static void main(String[] args) throws IOException
   {
 Configuration conf = new Configuration();
 FileSystem fs = FileSystem.get(conf);
 FSDataInputStream fin = fs.open(new Path(args[0]));
 
 fin.seek(64*1024*1024 - 10);
 byte[] buffer = new byte[32*1024];
 int len = fin.read(buffer);
 //int len = fin.read(buffer, 0, 128);
 System.out.println(len);
 
 fin.close();
   }