Re: Reduce Performance

2007-08-23 Thread Thorsten Schuett
I added multi-threading to the map phase of the LocalRunner. The code is in the attached patch. What I also noticed during my experiments is that I have enough load to easily fill 8 cores, but my code should be IO-bound. I have the feeling that the SequenceFile or the framework wastes cpu

“Moving Computation is Cheaper than Moving Data”

2007-08-23 Thread Samuel LEMOINE
When I read the Hadoop documentation: The Hadoop Distributed File System: Architecture and Design (http://lucene.apache.org/hadoop/hdfs_design.html) a paragraph hold my attention: “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Samuel LEMOINE
Well, I don't get it... when you pass arguments to a map job, you just give a key and a value, how can hadoop make the link between those arguments and the data's concerned? Really, your answer don't help me at all, sorry ^^ Devaraj Das a écrit : That's the paradigm of Hadoop's Map-Reduce.

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Arun C Murthy
Samuel, Samuel LEMOINE wrote: Well, I don't get it... when you pass arguments to a map job, you just give a key and a value, how can hadoop make the link between those arguments and the data's concerned? Really, your answer don't help me at all, sorry ^^ The input of a map-reduce job is a

Re: Reduce Performance

2007-08-23 Thread Doug Cutting
Thorsten Schuett wrote: During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see what else could limit the performance here. The disk

Re: Reduce Performance

2007-08-23 Thread Thorsten Schuett
On Thursday 23 August 2007, Doug Cutting wrote: Thorsten Schuett wrote: During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see

Problem submitting a job with hadoop 0.14.0

2007-08-23 Thread Thomas Friol
Hi all, We just moved to the 0.14.0 distribution of hadoop. Until now, we were running the 0.10.1 one. Important point : the client submitting jobs is on a total different machine from the master and the slaves and also it is a total different user. The main problem is the parameter

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Samuel LEMOINE
Thanks so much, it helps me a lot. I'm actually quite lost with Hadoop's mechanisms. The point of my study is to distribute the Lucene searching phase with Hadoop... According to what I'v understood, a way to distribute the search over a big Lucene's index would be to put this index on HDFS,

Re: Reduce Performance

2007-08-23 Thread Raghu Angadi
Thorsten Schuett wrote: On Wednesday 22 August 2007, Doug Cutting wrote: Thorsten Schuett wrote: In my case, it looks as if the loopback device is the bottleneck. So increasing the number of tasks won't help. Hmm. I have trouble believing that the loopback device is actually the bottleneck.

RE: Poly-reduce?

2007-08-23 Thread Joydeep Sen Sarma
Completely agree. We are seeing the same pattern - need a series of map-reduce jobs for most stuff. There are a few different alternatives that may help: 1. The output of the intermediate reduce phases can be written to files that are not replicated. Not sure whether we can do this through

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Ted Dunning
Actually, the inputs to the map function are a key and value pair. The inputs to the map JOB, however, is a slice of a file from which inputs for the map function will be taken. Since the manager that creates the map job knows where file slice is stored, it can start the job there. On

Re: Problem submitting a job with hadoop 0.14.0

2007-08-23 Thread Owen O'Malley
On Aug 23, 2007, at 7:58 AM, Thomas Friol wrote: Important point : the client submitting jobs is on a total different machine from the master and the slaves and also it is a total different user. The main problem is the parameter 'hadoop.tmp.dir' which default value is

Issues with 0.14.0...

2007-08-23 Thread C G
Hi All: I tried 0.14.0 today with limited success. 0.13.0 was doing pretty well, but I'm not able to get as far with 0.14.0. My environment is single-node, 4way box, 8G memory, 500G disk space. First up is an out-of-memory error. The dataset is 1,000,000 rows (but only 60M in

Re: Issues with 0.14.0...

2007-08-23 Thread Christophe Taton
Can you try to increase the java heap for tasks JVMs? The mapred.child.java.opts property in conf/hadoop-site.xml defaults to -Xmx200m.

Re: Issues with 0.14.0...

2007-08-23 Thread Raghu Angadi
Regd the second problem : It is surprising that this fails repeatedly around the same place. 0.14 does check the checksum at the datanode (0.13 did not do this check). I will try to reproduce this. Raghu. C G wrote: Hi All: Second issue is a failure on copyFromLocal with lost

Re: secondary namenode errors

2007-08-23 Thread Raghu Angadi
On a related note, please don't use 0.13.0, use the latest released version for 0.13 (I think it is 0.13.1). If the secondary namenode actually works, then it will resulting all the replications set to 1. Raghu. Joydeep Sen Sarma wrote: Hi folks, Would be grateful if someone can help

Re: Issues with 0.14.0...

2007-08-23 Thread C G
Further experimentation, again single node configuration on a 4way 8G machine w/0.14.0, trying to copyFromLocal 669M of data in 5,000,000 rows I see this in the namenode log: 2007-08-24 00:50:45,902 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.completeFile: failed to complete

Re: Issues with 0.14.0...

2007-08-23 Thread C G
Thanks Christophe, I kicked these values up to 512m and the case which previously failed runs to completion with verifiable results. Good stuff... Christophe Taton [EMAIL PROTECTED] wrote: Can you try to increase the java heap for tasks JVMs? The mapred.child.java.opts property in