On Tue, Jan 4, 2011 at 12:58 PM, Da Zheng <zhen...@cs.jhu.edu> wrote:

> The most important reason for me to use direct I/O is that the Atom
> processor is too weak. If I wrote a simple program to write data to the
> disk, CPU is almost 100% but the disk hasn't reached its maximal bandwidth.
> When I write data to SSD, the difference is even larger. Even if the program
> has saturated the two cores of the CPU, it cannot even get to the half of
> the maximal bandwidth of SSD.
>
>

The issue here is most likely checksumming.  Hadoop computes a CRC32 for
every 512 bytes it reads or writes.  On most processors, the CPU can easily
keep up and still saturate the pipe, but your atom is probably behind.
 There's a config somewhere to disable checksums, I'd suggest trying that.
 They're much more expensive CPU wise than simply byte funneling on the
client side (and the wire protocol which involves interleaving checksum data
with file data is the reason it would be really difficult to rewrite the
client to use direct I/O).





> I don't know how much benefit direct I/O can bring to the normal processor
> such as Xeon, but I have a feeling I have to use direct I/O in order to have
> good performance on Atom processors.
>
> Best,
> Da
>
>
> On 01/04/2011 10:12 AM, Segel, Mike wrote:
>
>> All,
>> While this is an interesting topic for debate, I think it's a moot point.
>> A lot of DBAs (Especially Informix DBAs) don't agree with Linus. (I'm
>> referring to an earlier post in this thread that referenced a quote from
>> Linus T.) Direct I/O is a good thing. But if Linus is removing it from
>> Linux...
>>
>> But with respect to Hadoop... disk i/o shouldn't be a major topic. I mean
>> if it were, then why isn't anyone pushing the use of SSDs? Or if they are
>> too expensive for your budget, why not SAS drives that spin at 15K?
>> Ok, those points are rhetorical. The simple solution is that if you're i/o
>> bound, you add more nodes with more disk to further distribute the load,
>> right?
>>
>> Also, I may be wrong, but do all OS(s) that one can run Hadoop, handle
>> Direct I/O? And handle it in a common way? So won't you end up having
>> machine/OS specific classes?
>>
>> IMHO there are other features that don't yet exist in Hadoop/HBase that
>> will yield a better ROI.
>>
>> Ok, so I may be way off base, so I'll shut up now... ;-P
>>
>> -Mike
>>
>>
>>
>> -----Original Message-----
>> From: Christopher Smith [mailto:cbsm...@gmail.com]
>> Sent: Tuesday, January 04, 2011 8:56 AM
>> To: common-dev@hadoop.apache.org
>> Subject: Re: Hadoop use direct I/O in Linux?
>>
>> On Mon, Jan 3, 2011 at 7:15 PM, Brian Bockelman<bbock...@cse.unl.edu
>> >wrote:
>>
>>  The I/O pattern isn't truly random.  To convert from physicist terms to
>>> CS
>>> terms, the application is iterating through the rows of a column-oriented
>>> store, reading out somewhere between 1 and 10% of the columns.  The twist
>>> is
>>> that the columns are compressed, meaning the size of a set of rows on
>>> disk
>>> is variable.
>>>
>>>  We're getting pretty far off topic here, but this is an interesting
>> problem.
>> It *sounds* to me like a "compressed bitmap index" problem, possibly with
>> bloom filters for joins (basically what HBase/Cassandra/Hypertable get in
>> to, or in a less distributed case: MonetDB). Is that on the money?
>>
>>
>>   This prevents any sort of OS page cache stride detection from helping -
>>> the OS sees everything as random.
>>>
>>>  It seems though like if you organized the data a certain way, the OS
>> page
>> cache could help.
>>
>>
>>    However, the application also has an index of where each row is
>>> located,
>>> meaning if it knows the active set of columns, it can predict the reads
>>> the
>>> client will perform and do a read-ahead.
>>>
>>>  Yes, this is the kind of advantage where O_DIRECT might help, although
>> I'd
>> hope in this kind of circumstance the OS buffer cache would mostly give up
>> anyway and just give as much of the available RAM as possible to the app.
>> In
>> that case memory mapped files with a thread doing a bit of read ahead
>> would
>> seem like not that much slower than using O_DIRECT.
>>
>> That said, I have to wonder how often this problem devolves in to a
>> straight
>> forward column scan. I mean, with a 1-10% hit rate, you need SSD seek
>> times
>> for it to make sense to seek to specific records vs. just scanning through
>> the whole column, or to put it another way: "disk is the new tape". ;-)
>>
>>
>>  Some days, it does feel like "building a better TCP using UDP".  However,
>>> we got a 3x performance improvement by building it (and multiplying by
>>> 10-15k cores for just our LHC experiment, that's real money!), so it's a
>>> particular monstrosity we are stuck with.
>>>
>>
>> It sure sounds like a problem better suited to C++ than Java though. What
>> benefits do you yield from doing all this with a JVM?
>>
>>
>

Reply via email to