Re: HDFS Random Access

Raghu Angadi Sat, 27 Jun 2009 09:42:29 -0700

Yes, FSDataInputStream allows random access. There are way to read xbytes at a position p:

1) in.seek(p); read(buf, 0, x);
2) in.(p, buf, 0, x);

These two have slightly different semantics. The second one is preferredand is easier for HDFS to optimize further.

Random access should be pretty good with HDFS and it is increasinglygetting more users and thus more importance. HBase is one of the users.

Just yesterday I attached a benchmark and comparissions to random accesson native filesystem to https://issues.apache.org/jira/browse/HDFS-236 .

As of now, the overhead on average is about 2 ms over 9-10ms it takesfor native read. There are a few fairly simple fixes possible to reducethis gap.

I think getFileStatus() is the way to find the length, though theremight have been a call added to FSDataInputStream recently. I am not sure.


Raghu.
tsuraan wrote:

All the documentation for HDFS says that it's for large streaming
jobs, but I couldn't find an explicit answer to this, so I'll try
asking here.  How is HDFS's random seek performance within an
FSDataInputStream?  I use lucene with a lot of indices (potentially
thousands), so I was thinking of putting them into HDFS and
reimplementing my search as a Hadoop map-reduce.  I've noticed that
lucene tends to do a bit of random seeking when searching though; I
don't believe that it guarantees that all seeks be to increasing file
positions either.

Would HDFS be a bad fit for an access pattern that involves seeks to
random positions within a stream?

Also, is getFileStatus the typical way of getting the length of a file
in HDFS, or is there some method on FSDataInputStream that I'm not
seeing?

Please cc: me on any reply; I'm not on the hadoop list.  Thanks!

Re: HDFS Random Access

Reply via email to