i presume you meant that the act of 'mounting' itself is not bad - but letting 
the entire cluster start reading from a hapless filer is :-)
 
i have actually found it very useful to upload files though map-reduce. we have 
periodic jobs that are in effect tailing nfs files and copying data to hdfs. 
because of random job placement, data is uniformly distributed. and because we 
run periodically, we usually don't need more than a task or two to copy in 
parallel.
 
the nice thing is that if we do ever fall behind (network glitches, filer 
overload, whatever) - the code automatically increases the number of readers to 
catch up (with certain bounds on number of concurrent readers). (something i 
would have lot more trouble doing outside of Hadoop)
 
the low hanging fruit we can contribute back are improvements to distcp 
(wildcards, parallel transfer of large text files) - but the larger setup is 
interesting (almost like a self-adjusting parallel rsync) that probably needs 
more generalization for wider use.

________________________________

From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Thu 12/20/2007 7:12 PM
To: [email protected]
Subject: Re: DFS Block Allocation




Distcp is a map-reduce program where the maps read the files.  This means
that all of your tasknodes have to be able to read the files in question.

Many times it is easier to have a writer push the files at the cluster,
especially if you are reading data from a conventional unix file system.  It
would be a VERY bad idea to mount an NFS file system on an entire cluster.


On 12/20/07 7:06 PM, "Rui Shi" <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
> I am confused a bit. What is the difference if I use "hadoop distcp" to upload
> files? I assume "hadoop distcp" using multiple trackers to upload files in
> parallel.
>
> Thanks,
>
> Rui
>
> ----- Original Message ----
> From: Ted Dunning <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, December 20, 2007 6:01:50 PM
> Subject: Re: DFS Block Allocation
>
>
>
>
>
> On 12/20/07 5:52 PM, "C G" <[EMAIL PROTECTED]> wrote:
>
>>   Ted, when you say "copy in the distro" do you need to include the
>> configuration files from the running grid?  You don't need to
>  actually start
>> HDFS on this node do you?
>
> You are correct.  You only need the config files (and the hadoop script
> helps make things easier).
>
>>   If I'm following this approach correctly, I would want to have an
>  "xfer
>> server" whose job it is to essentially run dfs -copyFromLocal on all
>> inbound-to-HDFS data. Once I'm certain that my data has copied
>  correctly, I
>> can delete the local files on the xfer server.
>
> Yes.
>
>>   This is great news, as my current system wastes a lot of time
>  copying data
>> from data acquisition servers to the master node. If I can copy to
>  HDFS
>> directly from ny acquisition servers then I am a happy guy....
>
> You are a happy guy.
>
> If your acquisition systems can see all of your datanodes.
>
>
>
>
>
>
>
>      
> ______________________________________________________________________________
> ______
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs



Reply via email to