Maybe copy your hdfs config here and we can see why it took up 16 gigs
of space.
Cheers
Sent from my mobile. Please excuse the typos.
On 2010-04-10, at 3:22 PM, "Michael Segel" <[email protected]>
wrote:
Mike,
First, you need to see what you set your block size to in Hadoop. By
default its 64MB. With large files, you may want to bump that up to
128 MB per block.
2GB file will give you roughly 20 m/r jobs.
I'd use hadoop fs -copyFromLocal <local file name> <hdfs file name>.
(Ok, I'm going from memory on the hadoop command, but you can always
do a hadoop help to see the command.)
Also you need to see what you set for your replication factor.
Usually its 3.
The your 2GB file will be roughly 6GB in size and should be balanced
on all of the nodes with 2 or 3 blocks per machine.
HTH
-Mike
Date: Sat, 10 Apr 2010 14:03:02 -0400
Subject: copying file into hdfs
From: [email protected]
To: [email protected]
Hi,
Im mike,
I am a new user of Hadoop. currently, I have a cluster of 8
machines and a
file of size 2 gigs.
When I load it into hdfs using command
hadoop dfs -put /a.dat /data
It actually loads it on all data nodes. dfsadmin -report shows hdfs
usage to
16 gigs. And it is taking 2 hours to load that data file.
with 1 node - my mapreduce operation on this data took 150 seconds.
So when I used my mapred operation on this cluster.. it is taking 220
seconds for same file.
Can some one please tell me How to distribute this file over 8
nodes - so
that each of them will have roughly 300 mbs of file chunk and the
mapreduce
operation that I have wrote to work in parallel? Isn't hadoop cluster
supposed to be working in parallel?
best.
_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars
with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5