Thanks for the quick responses! B/t these posts (distcp, dfs -cp and dfs -put) 
I should be able to figure it out.

----- Original Message ----
From: Ted Dunning <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, October 31, 2007 5:48:54 PM
Subject: Re: multiple file -put in dfs



This only handles the problem of putting lots of files.  It doesn't
 deal
with putting files in parallel (at once).

This is a ticklish problem since even on a relatively small cluster,
 dfs has
a higher read speed than most storage can read.  That means that you
 can
swamp things pretty easily.

When I have files on a single source machine, I just spawn multiple
 -put's
on sub-directories until I have sufficiently saturated the read speed
 of the
source.  If all of the cluster members have access to a universal file
system, then you can use the (undocumented) pdist command, but I don't
 like
that as much.

You also have to watch out if you start writing from a host in your
 cluster
else you will wind up with odd imbalances in file storage.  In my case,
 the
source of the data is actually outside of the cluster and I get pretty
 good
balancing.

If you do wind up with bad balancing, the best option I have seen is to
increase the replication on individual files for 30-60 seconds and then
decrease it again.  In order to get sufficient throughput for the
rebalancing, I pipeline lots of these changes so that I have 10-100
 files at
a time with higher replication.  This does tend to substantially
 increase
the number of files with excess replication, but that corrects itself
 pretty
quickly.


On 10/31/07 1:53 PM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:

> hadoop dfs -put will take a directory. If it won't work recursively,
> then you can probably bang out a bash script that will handle it
 using
> find(1) and xargs(1).
> 
> -- Aaron
> 
> Chris Fellows wrote:
>> Hello!
>> 
>> Quick simple question, hopefully someone out there could answer.
>> 
>> Does the hadoop dfs support putting multiple files at once?
>> 
>> The documentation says -put only works on one file. What's the best
 way to
>> import multiple files in multiple directories (i.e. dir1/file1
 dir1/file2
>> dir2/file1 dir2/file2 etc)?
>> 
>> End goal would be to do something like:
>> 
>>     bin/hadoop dfs -put /dir*/file* /myfiles
>> 
>> And a follow-up: bin/hadoop dfs -lsr /myfiles
>> would list:
>> 
>> /myfiles/dir1/file1
>> /myfiles/dir1/file2
>> /myfiles/dir2/file1
>> /myfiles/dir2/file2
>> 
>> Thanks again for any input!!!
>> 
>> - chris
>> 
>> 
>>   




Reply via email to