On Mon, Jan 3, 2011 at 5:05 PM, Brian Bockelman <bbock...@cse.unl.edu>wrote:
> On Jan 3, 2011, at 5:17 PM, Christopher Smith wrote: > > On Mon, Jan 3, 2011 at 11:40 AM, Brian Bockelman <bbock...@cse.unl.edu > >wrote: > > > >> It's not immediately clear to me the size of the benefit versus the > costs. > >> Two cases where one normally thinks about direct I/O are: > >> 1) The usage scenario is a cache anti-pattern. This will be true for > some > >> Hadoop use cases (MapReduce), not true for some others. > >> - http://www.jeffshafer.com/publications/papers/shafer_ispass10.pdf > >> 2) The application manages its own cache. Not applicable. > >> Atom processors, which you mention below, will just exacerbate (1) due > to > >> the small cache size. > >> > > > > Actually, assuming you thrash the cache anyway, having a smaller cache > can > > often be a good thing. ;-) > > Assuming no other thread wants to use that poor cache you are thrashing ;) Even then: a small cash can be cleared up more quickly. As in all cases, it very much depends on circumstance, but much like O_DIRECT, if you are blowing the cache anyway, there is little at stake. >> All-in-all, doing this specialization such that you don't hurt the > general > >> case is going to be tough. > > > > For the Hadoop case, the advantages of O_DIRECT would seem to be > > comparatively petty to using O_APPEND and/or MMAP (yes, I realize this is > > not quite the same as what you are proposing, but it seems close enough > for > > most cases.. Your best case for a win is when you have reasonably random > > access to a file, and then something else that would benefit from more > logve > > Actually, our particular site would greatly benefit from O_DIRECT - we have > non-MapReduce clients with a highly non-repetitive, random read I/O pattern > with an actively managed application-level read-ahead (note: because we're > almost guaranteed to wait for a disk seek - 2PB of SSDs are a touch pricey, > the latency overheads of Java are not actually too important). The OS page > cache is mostly useless for us as the working set size is on the order of a > few hundred TB. > Sounds like a lot of fun! Even in a circumstance like the one you describe, unless the I/O pattern isn't truly random and some application level insight provides a unique advantage, the page cache will often do a better job of managing the memory both in terms of caching and read-ahead (it becomes a lot like the "building a better TCP using UDP": possible, but not really worth the effort). If you can pull off zero-copy I/O, the O_DIRECT can be a huge win, but Java makes that very, very difficult, and horribly painful to manage. > However, I wouldn't actively clamor for O_DIRECT support, but could > probably do wonders with a HDFS-equivalent to fadvise. I really don't want > to get into the business of managing buffering in my application code any > more than we already do. Yes, I think a few minor simple tweaks to HDFS could help tremendously, particularly for Map/Reduce style jobs. > PS - if there are bored folks wanting to do something beneficial to > high-performance HDFS, I'd note that currently it is tough to get >1Gbps > performance from a single Hadoop client transferring multiple files. > However, HP labs had a clever approach: > http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf . I'd love to see > a generic, easy-to-use API to do this. > > Interesting. We haven't tried to push the envelope, but we have achieved >1Gbps... I can't recall if we ever got over 2Gbps though... -- Chris