Thanks for the response. I ended up using the distcp which I felt worked well and was quite straight forward. but as the source machine was part of the cluster, I did end up with fairly high imbalance. Ted noted several ways of balancing the cluster using replication. Are there any plans of introducing automatic balancing, so that in idle time, the namenode can balance out its nodes?
>You also have to watch out if you start writing from a host in your cluster >else you will wind up with odd imbalances in file storage. In my case, the >source of the data is actually outside of the cluster and I get pretty good >balancing. >If you do wind up with bad balancing, the best option I have seen is to >increase the replication on individual files for 30-60 seconds and then >decrease it again. In order to get sufficient throughput for the >rebalancing, I pipeline lots of these changes so that I have 10-100 files at >a time with higher replication. This does tend to substantially increase >the number of files with excess replication, but that corrects itself pretty >quickly. ----- Original Message ---- From: Ted Dunning <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, October 31, 2007 5:48:54 PM Subject: Re: multiple file -put in dfs This only handles the problem of putting lots of files. It doesn't deal with putting files in parallel (at once). This is a ticklish problem since even on a relatively small cluster, dfs has a higher read speed than most storage can read. That means that you can swamp things pretty easily. When I have files on a single source machine, I just spawn multiple -put's on sub-directories until I have sufficiently saturated the read speed of the source. If all of the cluster members have access to a universal file system, then you can use the (undocumented) pdist command, but I don't like that as much. You also have to watch out if you start writing from a host in your cluster else you will wind up with odd imbalances in file storage. In my case, the source of the data is actually outside of the cluster and I get pretty good balancing. If you do wind up with bad balancing, the best option I have seen is to increase the replication on individual files for 30-60 seconds and then decrease it again. In order to get sufficient throughput for the rebalancing, I pipeline lots of these changes so that I have 10-100 files at a time with higher replication. This does tend to substantially increase the number of files with excess replication, but that corrects itself pretty quickly. On 10/31/07 1:53 PM, "Aaron Kimball" <[EMAIL PROTECTED]> wrote: > hadoop dfs -put will take a directory. If it won't work recursively, > then you can probably bang out a bash script that will handle it using > find(1) and xargs(1). > > -- Aaron > > Chris Fellows wrote: >> Hello! >> >> Quick simple question, hopefully someone out there could answer. >> >> Does the hadoop dfs support putting multiple files at once? >> >> The documentation says -put only works on one file. What's the best way to >> import multiple files in multiple directories (i.e. dir1/file1 dir1/file2 >> dir2/file1 dir2/file2 etc)? >> >> End goal would be to do something like: >> >> bin/hadoop dfs -put /dir*/file* /myfiles >> >> And a follow-up: bin/hadoop dfs -lsr /myfiles >> would list: >> >> /myfiles/dir1/file1 >> /myfiles/dir1/file2 >> /myfiles/dir2/file1 >> /myfiles/dir2/file2 >> >> Thanks again for any input!!! >> >> - chris >> >> >>
