Re: multiple file -put in dfs

Chris Fellows Fri, 02 Nov 2007 07:54:18 -0800

Thanks for the response. I ended up using the distcp which I felt worked well 
and was quite straight forward. but as the source machine was part of the 
cluster, I did end up with fairly high imbalance. Ted noted several ways of 
balancing the cluster using replication. Are there any plans of introducing 
automatic balancing, so that in idle time, the namenode can balance out its 
nodes?

>You also have to watch out if you start writing from a host in your
 cluster
>else you will wind up with odd imbalances in file storage.  In my case,
 the
>source of the data is actually outside of the cluster and I get pretty
 good
>balancing.

>If you do wind up with bad balancing, the best option I have seen is to
>increase the replication on individual files for 30-60 seconds and then
>decrease it again.  In order to get sufficient throughput for the
>rebalancing, I pipeline lots of these changes so that I have 10-100
 files at
>a time with higher replication.  This does tend to substantially
 increase
>the number of files with excess replication, but that corrects itself
 pretty
>quickly.

----- Original Message ----
From: Ted Dunning <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, October 31, 2007 5:48:54 PM
Subject: Re: multiple file -put in dfs

This only handles the problem of putting lots of files.  It doesn't
 deal
with putting files in parallel (at once).

This is a ticklish problem since even on a relatively small cluster,
 dfs has
a higher read speed than most storage can read.  That means that you
 can
swamp things pretty easily.

When I have files on a single source machine, I just spawn multiple
 -put's
on sub-directories until I have sufficiently saturated the read speed
 of the
source.  If all of the cluster members have access to a universal file
system, then you can use the (undocumented) pdist command, but I don't
 like
that as much.

You also have to watch out if you start writing from a host in your
 cluster
else you will wind up with odd imbalances in file storage.  In my case,
 the
source of the data is actually outside of the cluster and I get pretty
 good
balancing.

If you do wind up with bad balancing, the best option I have seen is to
increase the replication on individual files for 30-60 seconds and then
decrease it again.  In order to get sufficient throughput for the
rebalancing, I pipeline lots of these changes so that I have 10-100
 files at
a time with higher replication.  This does tend to substantially
 increase
the number of files with excess replication, but that corrects itself
 pretty
quickly.

On 10/31/07 1:53 PM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote:

> hadoop dfs -put will take a directory. If it won't work recursively,
> then you can probably bang out a bash script that will handle it
 using
> find(1) and xargs(1).
> 
> -- Aaron
> 
> Chris Fellows wrote:
>> Hello!
>> 
>> Quick simple question, hopefully someone out there could answer.
>> 
>> Does the hadoop dfs support putting multiple files at once?
>> 
>> The documentation says -put only works on one file. What's the best
 way to
>> import multiple files in multiple directories (i.e. dir1/file1
 dir1/file2
>> dir2/file1 dir2/file2 etc)?
>> 
>> End goal would be to do something like:
>> 
>>     bin/hadoop dfs -put /dir*/file* /myfiles
>> 
>> And a follow-up: bin/hadoop dfs -lsr /myfiles
>> would list:
>> 
>> /myfiles/dir1/file1
>> /myfiles/dir1/file2
>> /myfiles/dir2/file1
>> /myfiles/dir2/file2
>> 
>> Thanks again for any input!!!
>> 
>> - chris
>> 
>> 
>>

Re: multiple file -put in dfs

Reply via email to