All,
While this is an interesting topic for debate, I think it's a moot point.
A lot of DBAs (Especially Informix DBAs) don't agree with Linus. (I'm referring
to an earlier post in this thread that referenced a quote from Linus T.) Direct
I/O is a good thing. But if Linus is removing it from Linux...
But with respect to Hadoop... disk i/o shouldn't be a major topic. I mean if it
were, then why isn't anyone pushing the use of SSDs? Or if they are too
expensive for your budget, why not SAS drives that spin at 15K?
Ok, those points are rhetorical. The simple solution is that if you're i/o
bound, you add more nodes with more disk to further distribute the load, right?
Also, I may be wrong, but do all OS(s) that one can run Hadoop, handle Direct
I/O? And handle it in a common way? So won't you end up having machine/OS
specific classes?
IMHO there are other features that don't yet exist in Hadoop/HBase that will
yield a better ROI.
Ok, so I may be way off base, so I'll shut up now... ;-P
-Mike
-----Original Message-----
From: Christopher Smith [mailto:cbsm...@gmail.com]
Sent: Tuesday, January 04, 2011 8:56 AM
To: common-dev@hadoop.apache.org
Subject: Re: Hadoop use direct I/O in Linux?
On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman<bbock...@cse.unl.edu>wrote:
The I/O pattern isn't truly random. To convert from physicist terms to CS
terms, the application is iterating through the rows of a column-oriented
store, reading out somewhere between 1 and 10% of the columns. The twist is
that the columns are compressed, meaning the size of a set of rows on disk
is variable.
We're getting pretty far off topic here, but this is an interesting problem.
It *sounds* to me like a "compressed bitmap index" problem, possibly with
bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
to, or in a less distributed case: MonetDB). Is that on the money?
This prevents any sort of OS page cache stride detection from helping -
the OS sees everything as random.
It seems though like if you organized the data a certain way, the OS page
cache could help.
However, the application also has an index of where each row is located,
meaning if it knows the active set of columns, it can predict the reads the
client will perform and do a read-ahead.
Yes, this is the kind of advantage where O_DIRECT might help, although I'd
hope in this kind of circumstance the OS buffer cache would mostly give up
anyway and just give as much of the available RAM as possible to the app. In
that case memory mapped files with a thread doing a bit of read ahead would
seem like not that much slower than using O_DIRECT.
That said, I have to wonder how often this problem devolves in to a straight
forward column scan. I mean, with a 1-10% hit rate, you need SSD seek times
for it to make sense to seek to specific records vs. just scanning through
the whole column, or to put it another way: "disk is the new tape". ;-)
Some days, it does feel like "building a better TCP using UDP". However,
we got a 3x performance improvement by building it (and multiplying by
10-15k cores for just our LHC experiment, that's real money!), so it's a
particular monstrosity we are stuck with.
It sure sounds like a problem better suited to C++ than Java though. What
benefits do you yield from doing all this with a JVM?