On Tue, Nov 24, 2009 at 10:35 AM, Raghu Angadi <[email protected]> wrote:
> Sequential read is the simplest case and it is pretty hard to improve upon > the current raw performance (HDFS client does take more CPU than one might > expect, Todd implemented an improvement for CPU consumed). > > Just to reiterate what Todd said, there is an implicit read ahead for > sequential reads with TCP buffers and kernel read ahead on Datanodes. > > The one thing that explicit readahead may benefit for us is dealing with the fact that Linux's readahead implementation does very poorly with detecting readahead when you have multiple parallel sequential readers on the same block device. This is often the case with Hadoop, and the default schedulers do a pretty bad job of it. Explicitly doing your own readahead allows the scheduler to do a better job of avoiding seeks, and you can overlap CPU and IO much better. I think this would benefit the various Mergers in particular. -Todd > If you extend the read ahead buffer to be more of a buffer cache for the > block, it could have big impact for some read access patterns (e.g. binary > search). > > Raghu. > > On Mon, Nov 23, 2009 at 11:23 PM, Martin Mituzas <[email protected] > >wrote: > > > > > I read the code and find the call > > DFSInputStream.read(buf, off, len) > > will cause the DataNode read len bytes (or less if encounting the end of > > block) , why does not hdfs read ahead to improve performance for > sequential > > read? > > -- > > View this message in context: > > > http://old.nabble.com/why-does-not-hdfs-read-ahead---tp26491449p26491449.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > >
