I added multi-threading to the map phase of the LocalRunner. The code is in
the attached patch.
What I also noticed during my experiments is that I have enough load to easily
fill 8 cores, but my code should be IO-bound. I have the feeling that the
SequenceFile or the framework wastes cpu
When I read the Hadoop documentation:
The Hadoop Distributed File System: Architecture and Design
(http://lucene.apache.org/hadoop/hdfs_design.html)
a paragraph hold my attention:
“Moving Computation is Cheaper than Moving Data”
A computation requested by an application is much more
Well, I don't get it... when you pass arguments to a map job, you just
give a key and a value, how can hadoop make the link between those
arguments and the data's concerned? Really, your answer don't help me at
all, sorry ^^
Devaraj Das a écrit :
That's the paradigm of Hadoop's Map-Reduce.
Samuel,
Samuel LEMOINE wrote:
Well, I don't get it... when you pass arguments to a map job, you just
give a key and a value, how can hadoop make the link between those
arguments and the data's concerned? Really, your answer don't help me at
all, sorry ^^
The input of a map-reduce job is a
Thorsten Schuett wrote:
During the copy phase of reduce, the cpu load was very low and vmstat showed
constant reads from the disk at ~15MB/s and bursty writes. At the same time,
data was sent over the loopback device at ~15MB/s. I don't see what else
could limit the performance here. The disk
On Thursday 23 August 2007, Doug Cutting wrote:
Thorsten Schuett wrote:
During the copy phase of reduce, the cpu load was very low and vmstat
showed constant reads from the disk at ~15MB/s and bursty writes. At the
same time, data was sent over the loopback device at ~15MB/s. I don't see
Hi all,
We just moved to the 0.14.0 distribution of hadoop. Until now, we were
running the 0.10.1 one.
Important point : the client submitting jobs is on a total different
machine from the master and the slaves and also it is a total different
user.
The main problem is the parameter
Thanks so much, it helps me a lot. I'm actually quite lost with Hadoop's
mechanisms.
The point of my study is to distribute the Lucene searching phase with
Hadoop...
According to what I'v understood, a way to distribute the search over a
big Lucene's index would be to put this index on HDFS,
Thorsten Schuett wrote:
On Wednesday 22 August 2007, Doug Cutting wrote:
Thorsten Schuett wrote:
In my case, it looks as if the loopback device is the bottleneck. So
increasing the number of tasks won't help.
Hmm. I have trouble believing that the loopback device is actually the
bottleneck.
Completely agree. We are seeing the same pattern - need a series of
map-reduce jobs for most stuff. There are a few different alternatives
that may help:
1. The output of the intermediate reduce phases can be written to files
that are not replicated. Not sure whether we can do this through
Actually, the inputs to the map function are a key and value pair.
The inputs to the map JOB, however, is a slice of a file from which inputs
for the map function will be taken. Since the manager that creates the map
job knows where file slice is stored, it can start the job there.
On
On Aug 23, 2007, at 7:58 AM, Thomas Friol wrote:
Important point : the client submitting jobs is on a total different
machine from the master and the slaves and also it is a total
different
user.
The main problem is the parameter 'hadoop.tmp.dir' which default value
is
Hi All:
I tried 0.14.0 today with limited success. 0.13.0 was doing pretty well, but
I'm not able to get as far with 0.14.0.
My environment is single-node, 4way box, 8G memory, 500G disk space.
First up is an out-of-memory error. The dataset is 1,000,000 rows (but only
60M in
Can you try to increase the java heap for tasks JVMs? The
mapred.child.java.opts property in conf/hadoop-site.xml defaults to
-Xmx200m.
Regd the second problem :
It is surprising that this fails repeatedly around the same place. 0.14
does check the checksum at the datanode (0.13 did not do this check). I
will try to reproduce this.
Raghu.
C G wrote:
Hi All:
Second issue is a failure on copyFromLocal with lost
On a related note, please don't use 0.13.0, use the latest released
version for 0.13 (I think it is 0.13.1). If the secondary namenode
actually works, then it will resulting all the replications set to 1.
Raghu.
Joydeep Sen Sarma wrote:
Hi folks,
Would be grateful if someone can help
Further experimentation, again single node configuration on a 4way 8G machine
w/0.14.0, trying to copyFromLocal 669M of data in 5,000,000 rows I see this in
the namenode log:
2007-08-24 00:50:45,902 WARN org.apache.hadoop.dfs.StateChange: DIR*
NameSystem.completeFile: failed to complete
Thanks Christophe, I kicked these values up to 512m and the case which
previously failed runs to completion with verifiable results. Good stuff...
Christophe Taton [EMAIL PROTECTED] wrote:
Can you try to increase the java heap for tasks JVMs? The
mapred.child.java.opts property in
18 matches
Mail list logo