Maybe copy your hdfs config here and we can see why it took up 16 gigs of space.

Cheers

Sent from my mobile. Please excuse the typos.

On 2010-04-10, at 3:22 PM, "Michael Segel" <[email protected]> wrote:



Mike,

First, you need to see what you set your block size to in Hadoop. By default its 64MB. With large files, you may want to bump that up to 128 MB per block.
2GB file will give you roughly 20 m/r jobs.

I'd use hadoop fs -copyFromLocal <local file name> <hdfs file name>.

(Ok, I'm going from memory on the hadoop command, but you can always do a hadoop help to see the command.)

Also you need to see what you set for your replication factor. Usually its 3.

The your 2GB file will be roughly 6GB in size and should be balanced on all of the nodes with 2 or 3 blocks per machine.

HTH

-Mike

Date: Sat, 10 Apr 2010 14:03:02 -0400
Subject: copying file into hdfs
From: [email protected]
To: [email protected]

Hi,

Im mike,
I am a new user of Hadoop. currently, I have a cluster of 8 machines and a
file of size 2 gigs.
When I load it into hdfs using command
hadoop dfs -put /a.dat /data
It actually loads it on all data nodes. dfsadmin -report shows hdfs usage to
16 gigs. And it is taking 2 hours to load that data file.

with 1 node - my mapreduce operation on this data took 150 seconds.

So when I used my mapred operation on this cluster.. it is taking 220
seconds for same file.

Can some one please tell me How to distribute this file over 8 nodes - so that each of them will have roughly 300 mbs of file chunk and the mapreduce
operation that I have wrote to work in parallel? Isn't hadoop cluster
supposed to be working in parallel?

best.

_________________________________________________________________
The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multicalendar&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_5

Reply via email to