Reduce task copy: very low speed

2008-12-27 Thread Genady
Hi,

 

I've built a Hadoop cluster from two computers( master and slave), using
Hadoop 0.18.2/HBase 0.18.1.

While running Map-Reduce jobs on 5-10 GB files I've noticed that reduce-copy
tasks from master to slave is taking too much time( ~30 minutes each ) with
speed about 0.10 MB/s, despite the fact that master is connected to slave
via 1GB switch, and I did /etc/hosts mapping using LAN addresses(10.x.x.x).

 

My questions: 

-  Is there is a way to force hadoop to use ftp for example for copy
of files? 

-  Is there is some hadoop-site.xml configuration to improve copy
files performance?

 

I've tried to copy files with ftp ( master - slave computers ) and it
works with average speed 50Mb/s.

 

From reduce task lists web page ( only slave tasks):

 

 

reduce  copy (67 of 69 at 0.89 MB/s): task on master 

reduce  copy (29 of 69 at 0.10 MB/s): task on slave

 

Thanks in advance for any help or direction to search,

 

Genady 

 

 



Re: Reduce task copy: very low speed

2008-12-27 Thread d0ng

I meet the same problem,the copy is too slow :(

Genady

Hi,

 


I've built a Hadoop cluster from two computers( master and slave), using
Hadoop 0.18.2/HBase 0.18.1.

While running Map-Reduce jobs on 5-10 GB files I've noticed that reduce-copy
tasks from master to slave is taking too much time( ~30 minutes each ) with
speed about 0.10 MB/s, despite the fact that master is connected to slave
via 1GB switch, and I did /etc/hosts mapping using LAN addresses(10.x.x.x).

 

My questions: 


-  Is there is a way to force hadoop to use ftp for example for copy
of files? 


-  Is there is some hadoop-site.xml configuration to improve copy
files performance?

 


I've tried to copy files with ftp ( master - slave computers ) and it
works with average speed 50Mb/s.

 


From reduce task lists web page ( only slave tasks):

 

 

reduce  copy (67 of 69 at 0.89 MB/s): task on master 


reduce  copy (29 of 69 at 0.10 MB/s): task on slave

 


Thanks in advance for any help or direction to search,

 

Genady 

 

 



  




Threads per mapreduce job

2008-12-27 Thread Michael
Hi everyone:
How do I control the number of threads per mapreduce job.  I am using
bin/hadoop jar wordcount to run jobs and even though I have found these
settings in hadoop-default.xml and changed the values to 1:
namemapred.tasktracker.map.tasks.maximum/name
namemapred.tasktracker.reduce.tasks.maximum/name

The output of the job seems to indicate otherwise.
08/12/26 18:21:12 INFO mapred.JobClient:   Job Counters
08/12/26 18:21:12 INFO mapred.JobClient: Launched reduce tasks=1
08/12/26 18:21:12 INFO mapred.JobClient: Rack-local map tasks=12
08/12/26 18:21:12 INFO mapred.JobClient: Launched map tasks=17
08/12/26 18:21:12 INFO mapred.JobClient: Data-local map tasks=4

I have 2 servers running the mapreduce process and the datanode process.
Thanks,
Michael



Re: Reduce task copy: very low speed

2008-12-27 Thread Brian Bockelman

Hey,

Are all the mappings done?  If it's waiting for the last mapping to  
finish, it can't copy the output of that last mapping, meaning the  
average copy speed goes way down.


In other words, you are comparing the theoretical instantaneous copy  
speed (1Gbps) versus the printout of the average speed, which includes  
the amount of time for all the mappings to finish.


Transferring via FTP would be pointless (after all, FTP is a single  
datastream of TCP and HTTP is ... a single datastream of TCP) and  
there is nothing in hadoop-site.xml to tweak because the copy  
processes are waiting for the source data to be created.  The best  
solution would be to add many more nodes to the cluster :)


Brian

On Dec 27, 2008, at 8:10 AM, d0ng wrote:


I meet the same problem,the copy is too slow :(

Genady

Hi,


I've built a Hadoop cluster from two computers( master and slave),  
using

Hadoop 0.18.2/HBase 0.18.1.

While running Map-Reduce jobs on 5-10 GB files I've noticed that  
reduce-copy
tasks from master to slave is taking too much time( ~30 minutes  
each ) with
speed about 0.10 MB/s, despite the fact that master is connected to  
slave
via 1GB switch, and I did /etc/hosts mapping using LAN  
addresses(10.x.x.x).



My questions:
-  Is there is a way to force hadoop to use ftp for example  
for copy

of files?
-  Is there is some hadoop-site.xml configuration to  
improve copy

files performance?


I've tried to copy files with ftp ( master - slave computers )  
and it

works with average speed 50Mb/s.


From reduce task lists web page ( only slave tasks):



reduce  copy (67 of 69 at 0.89 MB/s): task on master
reduce  copy (29 of 69 at 0.10 MB/s): task on slave


Thanks in advance for any help or direction to search,


Genady








Re: Threads per mapreduce job

2008-12-27 Thread Sagar Naik

mapred.map.multithreadedrunner.threads
is the property u r looking for


Michael wrote:

Hi everyone:
How do I control the number of threads per mapreduce job.  I am using
bin/hadoop jar wordcount to run jobs and even though I have found these
settings in hadoop-default.xml and changed the values to 1:
namemapred.tasktracker.map.tasks.maximum/name
namemapred.tasktracker.reduce.tasks.maximum/name

The output of the job seems to indicate otherwise.
08/12/26 18:21:12 INFO mapred.JobClient:   Job Counters
08/12/26 18:21:12 INFO mapred.JobClient: Launched reduce tasks=1
08/12/26 18:21:12 INFO mapred.JobClient: Rack-local map tasks=12
08/12/26 18:21:12 INFO mapred.JobClient: Launched map tasks=17
08/12/26 18:21:12 INFO mapred.JobClient: Data-local map tasks=4

I have 2 servers running the mapreduce process and the datanode process.
Thanks,
Michael

  




Threads per mapreduce job

2008-12-27 Thread Michael
Hi everyone:
How do I control the number of threads per mapreduce job.  I am using
bin/hadoop jar wordcount to run jobs and even though I have found these
settings in hadoop-default.xml and changed the values to 1:
namemapred.tasktracker.map.tasks.maximum/name
namemapred.tasktracker.reduce.tasks.maximum/name

The output of the job seems to indicate otherwise.
08/12/26 18:21:12 INFO mapred.JobClient:   Job Counters
08/12/26 18:21:12 INFO mapred.JobClient: Launched reduce tasks=1
08/12/26 18:21:12 INFO mapred.JobClient: Rack-local map tasks=12
08/12/26 18:21:12 INFO mapred.JobClient: Launched map tasks=17
08/12/26 18:21:12 INFO mapred.JobClient: Data-local map tasks=4

I have 2 servers running the mapreduce process and the datanode process.
Thanks,
Michael



Re: Threads per mapreduce job

2008-12-27 Thread Michael
Thanks Sagar,
However, when I add this to my hadoop-site.xml it doesn't listen:
property
  namemapred.map.multithreadedrunner.threads/name
  value1/value
/property
I added it to both servers and here is the output of a test mapreduce run:
08/12/27 16:09:05 INFO mapred.JobClient:   Job Counters
08/12/27 16:09:05 INFO mapred.JobClient: Launched reduce tasks=1
08/12/27 16:09:05 INFO mapred.JobClient: Rack-local map tasks=16
08/12/27 16:09:05 INFO mapred.JobClient: Launched map tasks=16

Thanks,
Michael

Sagar Naik wrote:
 mapred.map.multithreadedrunner.threads
 is the property u r looking for
 
 
 Michael wrote:
 Hi everyone:
 How do I control the number of threads per mapreduce job.  I am using
 bin/hadoop jar wordcount to run jobs and even though I have found these
 settings in hadoop-default.xml and changed the values to 1:
 namemapred.tasktracker.map.tasks.maximum/name
 namemapred.tasktracker.reduce.tasks.maximum/name

 The output of the job seems to indicate otherwise.
 08/12/26 18:21:12 INFO mapred.JobClient:   Job Counters
 08/12/26 18:21:12 INFO mapred.JobClient: Launched reduce tasks=1
 08/12/26 18:21:12 INFO mapred.JobClient: Rack-local map tasks=12
 08/12/26 18:21:12 INFO mapred.JobClient: Launched map tasks=17
 08/12/26 18:21:12 INFO mapred.JobClient: Data-local map tasks=4

 I have 2 servers running the mapreduce process and the datanode process.
 Thanks,
 Michael

   
 


Re: Threads per mapreduce job

2008-12-27 Thread Michael Miceli
Oh ok. So the file had 17 chunks. Thank you very much.
--Michael

On Dec 27, 2008 10:28 PM, Brian Bockelman bbock...@cse.unl.edu wrote:

Hey Michael,

I think you're misreading things.  There are indeed 17 launched map tasks,
which are run one at a time, not in parallel.

Brian

On Dec 27, 2008, at 9:24 AM, Michael wrote:  Hi everyone:  How do I
control the number of threa...


Re: issues with hadoop in AIX

2008-12-27 Thread Allen Wittenauer
On 12/27/08 12:18 AM, Arun Venugopal arunvenugopa...@gmail.com wrote:
 Yes, I was able to run this on AIX as well with a minor change to the
 DF.java code. But this was more of a proof of concept than on a
 production system.

There are lots of places where Hadoop (esp. in contrib) interprets the
output of Unix command line utilities. Changes like this are likely going to
be required for AIX and other Unix systems that aren't being used by a
committer. :(