date:20070823

Re: Reduce Performance

2007-08-23 Thread Thorsten Schuett

I added multi-threading to the map phase of the LocalRunner. The code is in the attached patch. What I also noticed during my experiments is that I have enough load to easily fill 8 cores, but my code should be IO-bound. I have the feeling that the SequenceFile or the framework wastes cpu

“Moving Computation is Cheaper than Moving Data”

2007-08-23 Thread Samuel LEMOINE

When I read the Hadoop documentation: The Hadoop Distributed File System: Architecture and Design (http://lucene.apache.org/hadoop/hdfs_design.html) a paragraph hold my attention: “Moving Computation is Cheaper than Moving Data” A computation requested by an application is much more

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Samuel LEMOINE

Well, I don't get it... when you pass arguments to a map job, you just give a key and a value, how can hadoop make the link between those arguments and the data's concerned? Really, your answer don't help me at all, sorry ^^ Devaraj Das a écrit : That's the paradigm of Hadoop's Map-Reduce.

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Arun C Murthy

Samuel, Samuel LEMOINE wrote: Well, I don't get it... when you pass arguments to a map job, you just give a key and a value, how can hadoop make the link between those arguments and the data's concerned? Really, your answer don't help me at all, sorry ^^ The input of a map-reduce job is a

Re: Reduce Performance

2007-08-23 Thread Doug Cutting

Thorsten Schuett wrote: During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see what else could limit the performance here. The disk

Re: Reduce Performance

2007-08-23 Thread Thorsten Schuett

On Thursday 23 August 2007, Doug Cutting wrote: Thorsten Schuett wrote: During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see

Problem submitting a job with hadoop 0.14.0

2007-08-23 Thread Thomas Friol

Hi all, We just moved to the 0.14.0 distribution of hadoop. Until now, we were running the 0.10.1 one. Important point : the client submitting jobs is on a total different machine from the master and the slaves and also it is a total different user. The main problem is the parameter

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Samuel LEMOINE

Thanks so much, it helps me a lot. I'm actually quite lost with Hadoop's mechanisms. The point of my study is to distribute the Lucene searching phase with Hadoop... According to what I'v understood, a way to distribute the search over a big Lucene's index would be to put this index on HDFS,

Re: Reduce Performance

2007-08-23 Thread Raghu Angadi

Thorsten Schuett wrote: On Wednesday 22 August 2007, Doug Cutting wrote: Thorsten Schuett wrote: In my case, it looks as if the loopback device is the bottleneck. So increasing the number of tasks won't help. Hmm. I have trouble believing that the loopback device is actually the bottleneck.

RE: Poly-reduce?

2007-08-23 Thread Joydeep Sen Sarma

Completely agree. We are seeing the same pattern - need a series of map-reduce jobs for most stuff. There are a few different alternatives that may help: 1. The output of the intermediate reduce phases can be written to files that are not replicated. Not sure whether we can do this through

Re: Moving Computation is Cheaper than Moving Data

2007-08-23 Thread Ted Dunning

Actually, the inputs to the map function are a key and value pair. The inputs to the map JOB, however, is a slice of a file from which inputs for the map function will be taken. Since the manager that creates the map job knows where file slice is stored, it can start the job there. On

Re: Problem submitting a job with hadoop 0.14.0

2007-08-23 Thread Owen O'Malley

On Aug 23, 2007, at 7:58 AM, Thomas Friol wrote: Important point : the client submitting jobs is on a total different machine from the master and the slaves and also it is a total different user. The main problem is the parameter 'hadoop.tmp.dir' which default value is

Issues with 0.14.0...

2007-08-23 Thread C G

Hi All: I tried 0.14.0 today with limited success. 0.13.0 was doing pretty well, but I'm not able to get as far with 0.14.0. My environment is single-node, 4way box, 8G memory, 500G disk space. First up is an out-of-memory error. The dataset is 1,000,000 rows (but only 60M in

Re: Issues with 0.14.0...

2007-08-23 Thread Christophe Taton

Can you try to increase the java heap for tasks JVMs? The mapred.child.java.opts property in conf/hadoop-site.xml defaults to -Xmx200m.

Re: Issues with 0.14.0...

2007-08-23 Thread Raghu Angadi

Regd the second problem : It is surprising that this fails repeatedly around the same place. 0.14 does check the checksum at the datanode (0.13 did not do this check). I will try to reproduce this. Raghu. C G wrote: Hi All: Second issue is a failure on copyFromLocal with lost

Re: secondary namenode errors

2007-08-23 Thread Raghu Angadi

On a related note, please don't use 0.13.0, use the latest released version for 0.13 (I think it is 0.13.1). If the secondary namenode actually works, then it will resulting all the replications set to 1. Raghu. Joydeep Sen Sarma wrote: Hi folks, Would be grateful if someone can help

Re: Issues with 0.14.0...

2007-08-23 Thread C G

Further experimentation, again single node configuration on a 4way 8G machine w/0.14.0, trying to copyFromLocal 669M of data in 5,000,000 rows I see this in the namenode log: 2007-08-24 00:50:45,902 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.completeFile: failed to complete

Re: Issues with 0.14.0...

2007-08-23 Thread C G

Thanks Christophe, I kicked these values up to 512m and the case which previously failed runs to completion with verifiable results. Good stuff... Christophe Taton [EMAIL PROTECTED] wrote: Can you try to increase the java heap for tasks JVMs? The mapred.child.java.opts property in

Re: Reduce Performance

“Moving Computation is Cheaper than Moving Data”

Re: Moving Computation is Cheaper than Moving Data

Re: Moving Computation is Cheaper than Moving Data

Re: Reduce Performance

Re: Reduce Performance

Problem submitting a job with hadoop 0.14.0

Re: Moving Computation is Cheaper than Moving Data

Re: Reduce Performance

RE: Poly-reduce?

Re: Moving Computation is Cheaper than Moving Data

Re: Problem submitting a job with hadoop 0.14.0

Issues with 0.14.0...

Re: Issues with 0.14.0...

Re: Issues with 0.14.0...

Re: secondary namenode errors

Re: Issues with 0.14.0...

Re: Issues with 0.14.0...

18 matches

Site Navigation

Mail list logo

Footer information