Re: parallel cat

Rita Thu, 07 Jul 2011 05:36:55 -0700

Thanks again Steve.

I will try to implement it with thrift.



On Thu, Jul 7, 2011 at 5:35 AM, Steve Loughran <[email protected]> wrote:

> On 07/07/11 08:22, Rita wrote:
>
>> Thanks Steve. This is exactly what I was looking for. Unfortunately, I don
>> see any example code for the implementation.
>>
>>
> No. I think I have access to russ's source somewhere, but there'd be
> paperwork in getting it released. Russ said it wasn't too hard to do, he
> just had to patch the DFS client to offer up the entire list of block
> locations to the client, and let the client program make the decision. If
> you discussed this on the hdfs-dev list (via a JIRA), you may be able to get
> a patch for this accepted, though you have to do the code and tests
> yourself.
>
>
>> On Wed, Jul 6, 2011 at 7:35 AM, Steve Loughran<[email protected]>  wrote:
>>
>>  On 06/07/11 11:08, Rita wrote:
>>>
>>>  I have many large files ranging from 2gb to 800gb and I use hadoop fs
>>>> -cat
>>>> a
>>>> lot to pipe to various programs.
>>>>
>>>> I was wondering if its possible to prefetch the data for clients with
>>>> more
>>>> bandwidth. Most of my clients have 10g interface and datanodes are 1g.
>>>>
>>>> I was thinking, prefetch x blocks (even though it will cost extra
>>>> memory)
>>>> while reading block y. After block y is read, read the prefetched
>>>> blocked
>>>> and then throw it away.
>>>>
>>>> It should be used like this:
>>>>
>>>>
>>>> export PREFETCH_BLOCKS=2 #default would be 1
>>>> hadoop fs -pcat hdfs://namenode/verylarge file | program
>>>>
>>>> Any thoughts?
>>>>
>>>>
>>>>  Look at Russ Perry's work on doing very fast fetches from an HDFS
>>> filestore
>>> http://www.hpl.hp.com/****techreports/2009/HPL-2009-345.****pdf<http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf>
>>> <http://www.hpl.hp.com/**techreports/2009/HPL-2009-345.**pdf<http://www.hpl.hp.com/techreports/2009/HPL-2009-345.pdf>
>>> >
>>>
>>>
>>> Here the DFS client got some extra data on where every copy of every
>>> block
>>> was, and the client decided which machine to fetch it from. This made the
>>> best use of the entire cluster, by keeping each datanode busy.
>>>
>>>
>>> -steve
>>>
>>>
>>
>>
>>
>


-- 
--- Get your facts first, then you can distort them as you please.--

Re: parallel cat

Reply via email to