Mike,
First, you need to see what you set your block size to in Hadoop. By default its 64MB. With large files, you may want to bump that up to 128 MB per block. 2GB file will give you roughly 20 m/r jobs. I'd use hadoop fs -copyFromLocal <local file name> <hdfs file name>. (Ok, I'm going from memory on the hadoop command, but you can always do a hadoop help to see the command.) Also you need to see what you set for your replication factor. Usually its 3. The your 2GB file will be roughly 6GB in size and should be balanced on all of the nodes with 2 or 3 blocks per machine. HTH -Mike > Date: Sat, 10 Apr 2010 14:03:02 -0400 > Subject: copying file into hdfs > From: [email protected] > To: [email protected] > > Hi, > > Im mike, > I am a new user of Hadoop. currently, I have a cluster of 8 machines and a > file of size 2 gigs. > When I load it into hdfs using command > hadoop dfs -put /a.dat /data > It actually loads it on all data nodes. dfsadmin -report shows hdfs usage to > 16 gigs. And it is taking 2 hours to load that data file. > > with 1 node - my mapreduce operation on this data took 150 seconds. > > So when I used my mapred operation on this cluster.. it is taking 220 > seconds for same file. > > Can some one please tell me How to distribute this file over 8 nodes - so > that each of them will have roughly 300 mbs of file chunk and the mapreduce > operation that I have wrote to work in parallel? Isn't hadoop cluster > supposed to be working in parallel? > > best. _________________________________________________________________ The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5
