Distcp is a map-reduce program where the maps read the files. This means that all of your tasknodes have to be able to read the files in question.
Many times it is easier to have a writer push the files at the cluster, especially if you are reading data from a conventional unix file system. It would be a VERY bad idea to mount an NFS file system on an entire cluster. On 12/20/07 7:06 PM, "Rui Shi" <[EMAIL PROTECTED]> wrote: > > Hi, > > I am confused a bit. What is the difference if I use "hadoop distcp" to upload > files? I assume "hadoop distcp" using multiple trackers to upload files in > parallel. > > Thanks, > > Rui > > ----- Original Message ---- > From: Ted Dunning <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, December 20, 2007 6:01:50 PM > Subject: Re: DFS Block Allocation > > > > > > On 12/20/07 5:52 PM, "C G" <[EMAIL PROTECTED]> wrote: > >> Ted, when you say "copy in the distro" do you need to include the >> configuration files from the running grid? You don't need to > actually start >> HDFS on this node do you? > > You are correct. You only need the config files (and the hadoop script > helps make things easier). > >> If I'm following this approach correctly, I would want to have an > "xfer >> server" whose job it is to essentially run dfs -copyFromLocal on all >> inbound-to-HDFS data. Once I'm certain that my data has copied > correctly, I >> can delete the local files on the xfer server. > > Yes. > >> This is great news, as my current system wastes a lot of time > copying data >> from data acquisition servers to the master node. If I can copy to > HDFS >> directly from ny acquisition servers then I am a happy guy.... > > You are a happy guy. > > If your acquisition systems can see all of your datanodes. > > > > > > > > > ______________________________________________________________________________ > ______ > Never miss a thing. Make Yahoo your home page. > http://www.yahoo.com/r/hs
