Thanks Steve. This is exactly what I was looking for. Unfortunately, I don see any example code for the implementation.
On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran <[email protected]> wrote: > On 06/07/11 11:08, Rita wrote: > >> I have many large files ranging from 2gb to 800gb and I use hadoop fs -cat >> a >> lot to pipe to various programs. >> >> I was wondering if its possible to prefetch the data for clients with more >> bandwidth. Most of my clients have 10g interface and datanodes are 1g. >> >> I was thinking, prefetch x blocks (even though it will cost extra memory) >> while reading block y. After block y is read, read the prefetched blocked >> and then throw it away. >> >> It should be used like this: >> >> >> export PREFETCH_BLOCKS=2 #default would be 1 >> hadoop fs -pcat hdfs://namenode/verylarge file | program >> >> Any thoughts? >> >> > Look at Russ Perry's work on doing very fast fetches from an HDFS filestore > http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf> > > Here the DFS client got some extra data on where every copy of every block > was, and the client decided which machine to fetch it from. This made the > best use of the entire cluster, by keeping each datanode busy. > > > -steve > -- --- Get your facts first, then you can distort them as you please.--
