Yes, FSDataInputStream allows random access. There are way to read x
bytes at a position p:
1) in.seek(p); read(buf, 0, x);
2) in.(p, buf, 0, x);
These two have slightly different semantics. The second one is preferred
and is easier for HDFS to optimize further.
Random access should be pretty good with HDFS and it is increasingly
getting more users and thus more importance. HBase is one of the users.
Just yesterday I attached a benchmark and comparissions to random access
on native filesystem to https://issues.apache.org/jira/browse/HDFS-236 .
As of now, the overhead on average is about 2 ms over 9-10ms it takes
for native read. There are a few fairly simple fixes possible to reduce
this gap.
I think getFileStatus() is the way to find the length, though there
might have been a call added to FSDataInputStream recently. I am not sure.
Raghu.
tsuraan wrote:
All the documentation for HDFS says that it's for large streaming
jobs, but I couldn't find an explicit answer to this, so I'll try
asking here. How is HDFS's random seek performance within an
FSDataInputStream? I use lucene with a lot of indices (potentially
thousands), so I was thinking of putting them into HDFS and
reimplementing my search as a Hadoop map-reduce. I've noticed that
lucene tends to do a bit of random seeking when searching though; I
don't believe that it guarantees that all seeks be to increasing file
positions either.
Would HDFS be a bad fit for an access pattern that involves seeks to
random positions within a stream?
Also, is getFileStatus the typical way of getting the length of a file
in HDFS, or is there some method on FSDataInputStream that I'm not
seeing?
Please cc: me on any reply; I'm not on the hadoop list. Thanks!