On Mon, Jan 3, 2011 at 5:05 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote:

>  On Jan 3, 2011, at 5:17 PM, Christopher Smith wrote:
> > On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman <bbock...@cse.unl.edu
> >wrote:
> >
> >> It's not immediately clear to me the size of the benefit versus the
> costs.
> >> Two cases where one normally thinks about direct I/O are:
> >> 1) The usage scenario is a cache anti-pattern.  This will be true for
> some
> >> Hadoop use cases (MapReduce), not true for some others.
> >> - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf
> >> 2) The application manages its own cache.  Not applicable.
> >> Atom processors, which you mention below, will just exacerbate (1) due
> to
> >> the small cache size.
> >>
> >
> > Actually, assuming you thrash the cache anyway, having a smaller cache
> can
> > often be a good thing. ;-)
>
> Assuming no other thread wants to use that poor cache you are thrashing ;)


Even then: a small cash can be cleared up more quickly. As in all cases, it
very much depends on circumstance, but much like O_DIRECT, if you are
blowing the cache anyway, there is little at stake.

>> All-in-all, doing this specialization such that you don't hurt the
> general
> >> case is going to be tough.
> >
> > For the Hadoop case, the advantages of O_DIRECT would seem to be
> > comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is
> > not quite the same as what you are proposing, but it seems close enough
> for
> > most cases.. Your best case for a win is when you have reasonably random
> > access to a file, and then something else that would benefit from more
> logve
>
> Actually, our particular site would greatly benefit from O_DIRECT - we have
> non-MapReduce clients with a highly non-repetitive, random read I/O pattern
> with an actively managed application-level read-ahead (note: because we're
> almost guaranteed to wait for a disk seek - 2PB of SSDs are a touch pricey,
> the latency overheads of Java are not actually too important).  The OS page
> cache is mostly useless for us as the working set size is on the order of a
> few hundred TB.
>

Sounds like a lot of fun! Even in a circumstance like the one you describe,
unless the I/O pattern isn't truly random and some application level insight
provides a unique advantage, the page cache will often do a better job of
managing the memory both in terms of caching and read-ahead (it becomes a
lot like the "building a better TCP using UDP": possible, but not really
worth the effort). If you can pull off zero-copy I/O, the O_DIRECT can be a
huge win, but Java makes that very, very difficult, and horribly painful to
manage.


> However, I wouldn't actively clamor for O_DIRECT support, but could
> probably do wonders with a HDFS-equivalent to fadvise.  I really don't want
> to get into the business of managing buffering in my application code any
> more than we already do.


Yes, I think a few minor simple tweaks to HDFS could help tremendously,
particularly for Map/Reduce style jobs.


> PS - if there are bored folks wanting to do something beneficial to
> high-performance HDFS, I'd note that currently it is tough to get >1Gbps
> performance from a single Hadoop client transferring multiple files.
>  However, HP labs had a clever approach:
> http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf .  I'd love to see
> a generic, easy-to-use API to do this.
>
>
Interesting. We haven't tried to push the envelope, but we have achieved
>1Gbps... I can't recall if we ever got over 2Gbps though...

-- 
Chris

Reply via email to