Thanks again Steve. I will try to implement it with thrift.
On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran <[email protected]> wrote: > On 07/07/11 08:22, Rita wrote: > >> Thanks Steve. This is exactly what I was looking for. Unfortunately, I don >> see any example code for the implementation. >> >> > No. I think I have access to russ's source somewhere, but there'd be > paperwork in getting it released. Russ said it wasn't too hard to do, he > just had to patch the DFS client to offer up the entire list of block > locations to the client, and let the client program make the decision. If > you discussed this on the hdfs-dev list (via a JIRA), you may be able to get > a patch for this accepted, though you have to do the code and tests > yourself. > > >> On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran<[email protected]> wrote: >> >> On 06/07/11 11:08, Rita wrote: >>> >>> I have many large files ranging from 2gb to 800gb and I use hadoop fs >>>> -cat >>>> a >>>> lot to pipe to various programs. >>>> >>>> I was wondering if its possible to prefetch the data for clients with >>>> more >>>> bandwidth. Most of my clients have 10g interface and datanodes are 1g. >>>> >>>> I was thinking, prefetch x blocks (even though it will cost extra >>>> memory) >>>> while reading block y. After block y is read, read the prefetched >>>> blocked >>>> and then throw it away. >>>> >>>> It should be used like this: >>>> >>>> >>>> export PREFETCH_BLOCKS=2 #default would be 1 >>>> hadoop fs -pcat hdfs://namenode/verylarge file | program >>>> >>>> Any thoughts? >>>> >>>> >>>> Look at Russ Perry's work on doing very fast fetches from an HDFS >>> filestore >>> http://www.hpl.hp.com/****techreports/2009/HPL-2009-345.****pdf<http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf> >>> <http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf> >>> > >>> >>> >>> Here the DFS client got some extra data on where every copy of every >>> block >>> was, and the client decided which machine to fetch it from. This made the >>> best use of the entire cluster, by keeping each datanode busy. >>> >>> >>> -steve >>> >>> >> >> >> > -- --- Get your facts first, then you can distort them as you please.--
