date:20080320

Re: Hadoop streaming cacheArchive

2008-03-20 Thread Amareshwari Sriramadasu


Norbert Burger wrote:

I'm trying to use the cacheArchive command-line options with the
hadoop-0.15.3-streaming.jar.  I'm using the option as follows:

-cacheArchive hdfs://host:50001/user/root/lib.jar#lib

Unfortunately, my PERL scripts fail with an error consistent with not being
able to find the 'lib' directory (which, as I understand, should point back
to an extracted version of the lib.jar).

  
Here, lib is created as a symlink in task's working directory. It will 
have the jar file and extracted version of jar file.
Where are your PERL scripts searching for the lib? Is '.' included in 
your classpath.
Otherwise you can use mapred.job.classpath.archives config item, this 
adds the files to the classpath and also to the distributed cache

you can use
  -jobconf 
mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib

I know that the original JAR exists in HDFS, but I don't see any evidence of
lib.jar or a link called 'lib' inside my job.jar.  
link 'lib' will not be part of job.jar, but it will be distributed on 
all the nodes during task launch and task's current working directory 
will have the link 'lib' to the jar on cache.

How can I troubleshoot
cacheArchive further?  Should the files/dirs specified via cacheArchive be
contained inside the job.jar?  If not, where should they be in HDFS?

  
They can be anywhere on HDFS. You need give the complete path to add it 
to the cache.

Thanks for any help.

Norbert

Re: Trash option in hadoop-site.xml configuration.

2008-03-20 Thread Taeho Kang

Thank you for the clarification.

Here is my another question.
If two different clients ordered move to trash with different interval,
(e.g. client #1 with fs.trash.interval = 60; client #2 with
fs.trash.interval = 120)
what would happen?

Does namenode keep track of all these info?

/Taeho


On 3/20/08, dhruba Borthakur [EMAIL PROTECTED] wrote:

 The trash feature is a client side option and depends on the client
 configuration file. If the client's configuration specifies that Trash
 is enabled, then the HDFS client invokes a rename to Trash instead of
 a delete. Now, if Trash is enabled on the Namenode, then the
 Namenode periodically removes contents from the Trash directory.

 This design might be confusing to some users. But it provides the
 flexibility that different clients in the cluster can have either Trash
 enabled or disabled.

 Thanks,
 dhruba

 -Original Message-
 From: Taeho Kang [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 19, 2008 3:13 AM
 To: [EMAIL PROTECTED]; core-user@hadoop.apache.org;
 [EMAIL PROTECTED]
 Subject: Trash option in hadoop-site.xml configuration.

 Hello,

 I have these two machines that acts as a client to HDFS.

 Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60)
 and Node #2 has Trash option off (e.g. fs.trash.interval set to 0)

 When I order file deletion from Node #2, the file gets deleted right
 away.
 while the file gets moved to trash when I do the same from Node #1.

 This is a bit of surprise to me,
 because I thought Trash option that I have set in the master node's
 config
 file
 applies to everyone who connects to / uses the HDFS.

 Was there any reason why Trash option was implemented in this way?

 Thank you in advance,

 /Taeho

MapFile and MapFileOutputFormat

2008-03-20 Thread Rong-en Fan

Hi,

I have two questions regarding the mapfile in hadoop/hdfs. First, when using
MapFileOutputFormat as reducer's output, is there any way to change
the index interval (i.e., able to call setIndexInterval() on the
output MapFile)?
Second, is it possible to tell what is the position in data file for a given
key, assuming index interval is 1 and # of keys are small?

Thanks,
Rong-En Fan

RE: Trash option in hadoop-site.xml configuration.

2008-03-20 Thread dhruba Borthakur

Actually, the fs.trash.interval number has no significance on the client. If it 
is non-zero, then the client does a rename instead of a delete. The value 
specified in fs.trash.interval is used only by the namenode to periodically 
remove files from Trash: the periodicity is the value specified by 
fs.trash.interval on the namenode.

hope this helps,
dhruba


-Original Message-
From: Taeho Kang [mailto:[EMAIL PROTECTED]
Sent: Thu 3/20/2008 1:53 AM
To: core-user@hadoop.apache.org
Subject: Re: Trash option in hadoop-site.xml configuration.
 
Thank you for the clarification.

Here is my another question.
If two different clients ordered move to trash with different interval,
(e.g. client #1 with fs.trash.interval = 60; client #2 with
fs.trash.interval = 120)
what would happen?

Does namenode keep track of all these info?

/Taeho


On 3/20/08, dhruba Borthakur [EMAIL PROTECTED] wrote:

 The trash feature is a client side option and depends on the client
 configuration file. If the client's configuration specifies that Trash
 is enabled, then the HDFS client invokes a rename to Trash instead of
 a delete. Now, if Trash is enabled on the Namenode, then the
 Namenode periodically removes contents from the Trash directory.

 This design might be confusing to some users. But it provides the
 flexibility that different clients in the cluster can have either Trash
 enabled or disabled.

 Thanks,
 dhruba

 -Original Message-
 From: Taeho Kang [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 19, 2008 3:13 AM
 To: [EMAIL PROTECTED]; core-user@hadoop.apache.org;
 [EMAIL PROTECTED]
 Subject: Trash option in hadoop-site.xml configuration.

 Hello,

 I have these two machines that acts as a client to HDFS.

 Node #1 has Trash option enabled (e.g. fs.trash.interval set to 60)
 and Node #2 has Trash option off (e.g. fs.trash.interval set to 0)

 When I order file deletion from Node #2, the file gets deleted right
 away.
 while the file gets moved to trash when I do the same from Node #1.

 This is a bit of surprise to me,
 because I thought Trash option that I have set in the master node's
 config
 file
 applies to everyone who connects to / uses the HDFS.

 Was there any reason why Trash option was implemented in this way?

 Thank you in advance,

 /Taeho

Re: MapFile and MapFileOutputFormat

2008-03-20 Thread Doug Cutting


Rong-en Fan wrote:

I have two questions regarding the mapfile in hadoop/hdfs. First, when using
MapFileOutputFormat as reducer's output, is there any way to change
the index interval (i.e., able to call setIndexInterval() on the
output MapFile)?


Not at present.  It would probably be good to change MapFile to get this 
value from the Configuration.  A static method could be added, 
MapFile#setIndexInterval(Configuration conf, int interval), that sets 
io.mapfile.index.interval, and the MapFile constructor could read this 
property from the Configuration.  One could then use the static method 
to set this on jobs.


If you need this, please file an issue in Jira.  If possible, include a 
patch too.


http://wiki.apache.org/hadoop/HowToContribute


Second, is it possible to tell what is the position in data file for a given
key, assuming index interval is 1 and # of keys are small?


One could read the index file explicitly.  It's just a SequenceFile, 
listing keys and positions in the data file.  But why would you set 
the index interval to 1?  And why do you need to know the position?


Doug

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andrey Pankov


Hi,

Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It 
already contains such part:


echo Copying private key to slaves
for slave in `cat slaves`; do
  scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa
  ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa
  sleep 1
done

Anyway, did you tried hadoop-ec2 script? It works well for task you 
described.



Prasan Ary wrote:

Hi All,
  I have been trying to configure Hadoop on EC2 for large number of clusters ( 
100 plus). It seems that I have to copy EC2 private key to all the machines in 
the cluster so that they can have SSH connections.
  For now it seems I have to run a script to copy the key file to each of the 
EC2 instances. I wanted to know if there is a better way to accomplish this.
   
  Thanks,

  PA

   
-

Never miss a thing.   Make Yahoo your homepage.


---
Andrey Pankov

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Tom White

Yes, this isn't ideal for larger clusters. There's a jira to address
this: https://issues.apache.org/jira/browse/HADOOP-2410.

Tom

On 20/03/2008, Prasan Ary [EMAIL PROTECTED] wrote:
 Hi All,
   I have been trying to configure Hadoop on EC2 for large number of clusters 
 ( 100 plus). It seems that I have to copy EC2 private key to all the machines 
 in the cluster so that they can have SSH connections.
   For now it seems I have to run a script to copy the key file to each of the 
 EC2 instances. I wanted to know if there is a better way to accomplish this.

   Thanks,

   PA



  -
  Never miss a thing.   Make Yahoo your homepage.


-- 
Blog: http://www.lexemetech.com/

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andreas Kostyrka

Actually, I personally use the following 2 part copy technique to copy
files to a cluster of boxes:

tar cf - myfile | dsh -f host-list-file -i -c -M tar xCfv /tmp -

The first tar packages myfile into a tar file.

dsh runs a tar that unpacks the tar (in the above case all boxes listed
in host-list-file would have a /tmp/myfile after the command).

Tar options that are relevant include C (chdir) and v (verbose, can be
given twice) so you see what got copied.

dsh options that are relevant:
-i copy stdin to all ssh processes, requires -c
-c do the ssh calls concurrently.
-M prefix the out from the ssh with the hostname.

While this is not rsync, it has the benefit of being processed
concurrently, and quite flexible.

Andreas

Am Donnerstag, den 20.03.2008, 19:57 +0200 schrieb Andrey Pankov:
 Hi,
 
 Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It 
 already contains such part:
 
 echo Copying private key to slaves
 for slave in `cat slaves`; do
scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa
ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa
sleep 1
 done
 
 Anyway, did you tried hadoop-ec2 script? It works well for task you 
 described.
 
 
 Prasan Ary wrote:
  Hi All,
I have been trying to configure Hadoop on EC2 for large number of 
  clusters ( 100 plus). It seems that I have to copy EC2 private key to all 
  the machines in the cluster so that they can have SSH connections.
For now it seems I have to run a script to copy the key file to each of 
  the EC2 instances. I wanted to know if there is a better way to accomplish 
  this.
 
Thanks,
PA
  
 
  -
  Never miss a thing.   Make Yahoo your homepage.
 
 ---
 Andrey Pankov


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Chris K Wensel


you can't do this with the contrib/ec2 scripts/ami.

but passing the master private dns name to the slaves on boot as 'user- 
data' works fine. when a slave starts, it contacts the master and  
joins the cluster. there isn't any need for a slave to rsync from the  
master, thus removing the dependency on them having the private key.  
and not using the start|stop-all scripts, you don't need to maintain  
the slaves file, and can thus lazily boot your cluster.


to do this, you will need to create your own AMI that works this way.  
not hard, just time consuming.


On Mar 20, 2008, at 11:56 AM, Prasan Ary wrote:

Chris,
 What do you mean when you say boot the slaves with the master  
private name ?



 ===

Chris K Wensel [EMAIL PROTECTED] wrote:
 I found it much better to start the master first, then boot the  
slaves

with the master private name.

i do not use the start|stop-all scrips, so i do not need to maintain
the slaves file. thus i don't need to push private keys around to
support those scripts.

this lets me start 20 nodes, then add 20 more later. or kill some.

btw, get ganglia installed. life will be better knowing what's going  
on.


also, setting up FoxyProxy on firefox lets you browse your whole
cluster if you setup a ssh tunnel (socks).

On Mar 20, 2008, at 10:15 AM, Prasan Ary wrote:

Hi All,
I have been trying to configure Hadoop on EC2 for large number of
clusters ( 100 plus). It seems that I have to copy EC2 private key
to all the machines in the cluster so that they can have SSH
connections.
For now it seems I have to run a script to copy the key file to
each of the EC2 instances. I wanted to know if there is a better way
to accomplish this.

Thanks,
PA


-
Never miss a thing. Make Yahoo your homepage.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/






-
Looking for last minute shopping deals?  Find them fast with Yahoo!  
Search.


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/

using a set of MapFiles - getting the right partition

2008-03-20 Thread Chris Dyer

Hi all--
I would like to have a reducer generate a MapFile so that in later
processes I can look up the values associated with a few keys without
processing an entire sequence file.  However, if I have N reducers, I
will generate N different map files, so to pick the right map file I
will need to use the same partitioner as was used when partitioning
the keys to reducers (the reducer I have running emits one value for
each key it receives and no others).  Should this be done manually, ie
something like readers[partioner.getPartition(...)] or is there
another recommended method?

Eventually, I'm going to migrate to using HBase to store the key/value
pairs (since I'd to take advantage of HBase's ability to cache common
pairs in memory for faster retrieval), but I'm interested in seeing
what the performance is like just using MapFiles.

Thanks,
Chris

Re: Hadoop streaming cacheArchive

2008-03-20 Thread Norbert Burger

Amareshwari, thanks for your help.  This turned out to be user error (when
packaging my JAR, I inadvertently included a lib directory, so the libraries
actually existed in HDFS as ./lib/lib/perl..., when I was only expecting
./lib/perl...

Thanks again,
Norbert

On Thu, Mar 20, 2008 at 3:03 AM, Amareshwari Sriramadasu 
[EMAIL PROTECTED] wrote:

 Norbert Burger wrote:
  I'm trying to use the cacheArchive command-line options with the
  hadoop-0.15.3-streaming.jar.  I'm using the option as follows:
 
  -cacheArchive hdfs://host:50001/user/root/lib.jar#lib
 
  Unfortunately, my PERL scripts fail with an error consistent with not
 being
  able to find the 'lib' directory (which, as I understand, should point
 back
  to an extracted version of the lib.jar).
 
 
 Here, lib is created as a symlink in task's working directory. It will
 have the jar file and extracted version of jar file.
 Where are your PERL scripts searching for the lib? Is '.' included in
 your classpath.
 Otherwise you can use mapred.job.classpath.archives config item, this
 adds the files to the classpath and also to the distributed cache
 you can use
   -jobconf
 mapred.job.classpath.archives=hdfs://host:50001/user/root/lib.jar#lib
  I know that the original JAR exists in HDFS, but I don't see any
 evidence of
  lib.jar or a link called 'lib' inside my job.jar.
 link 'lib' will not be part of job.jar, but it will be distributed on
 all the nodes during task launch and task's current working directory
 will have the link 'lib' to the jar on cache.
  How can I troubleshoot
  cacheArchive further?  Should the files/dirs specified via cacheArchive
 be
  contained inside the job.jar?  If not, where should they be in HDFS?
 
 
 They can be anywhere on HDFS. You need give the complete path to add it
 to the cache.
  Thanks for any help.
 
  Norbert

Re: Limiting Total # of TaskTracker threads

2008-03-20 Thread Jimmy Wan


On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning [EMAIL PROTECTED] wrote:

I think the original request was to limit the sum of maps and reduces  
rather than limiting the two parameters independently.


Ted, yes this is exactly what I'm looking for. I just found an issue that  
seems to state that the old deprecated property is there, but it is not  
documented:


https://issues.apache.org/jira/browse/HADOOP-2300

I tried using the max tasks in combination with setting the new values,  
but that didn't seem to work. =( My machine labelled as LIMITED MACHINE  
had 2 maps and 1 reduce running at the same time.


The scenario I have is that I want to run multiple concurrent jobs through  
my cluster and have the CPU usage for that node be bound. Should I file a  
new issue?


This was all with Hadoop 0.16.0

LIMITED MACHINE:
property
  namemapred.tasktracker.tasks.maximum/name
  value2/value
  descriptionThe maximum number of total tasks that will be run
  simultaneously by a task tracker.
  /description
/property
property
  namemapred.tasktracker.map.tasks.maximum/name
  value1/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
/property
property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value1/value
  descriptionThe maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  /description
/property

OTHER CLUSTER MACHINES:
property
  namemapred.tasktracker.tasks.maximum/name
  value8/value
  descriptionThe maximum number of total tasks that will be run
  simultaneously by a task tracker.
  /description
/property
property
  namemapred.tasktracker.map.tasks.maximum/name
  value4/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
/property
property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value4/value
  descriptionThe maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  /description
/property


On 3/18/08 5:26 PM, Arun C Murthy [EMAIL PROTECTED] wrote:


The map/reduce tasks are not threads, they are run in separate JVMs
which are forked by the tasktracker.


Arun, yes, I did mean tasks, not threads.


--
Jimmy

Default Combiner or default combining behaviour?

2008-03-20 Thread Otis Gospodnetic

Hi,

The MapReduce tutorial mentions Combiners only in passing.  Is there a default 
Combiner or default combining behaviour?

Concretely, I want to make sure that records are not getting combined behind 
the scenes in some way without me seeing it, and causing me to lose data.  For 
instance, if there is a default Combiner or default combining behaviour that 
collapses multiple records with identical keys and values into a single record, 
I'd like to avoid that.  Instead of blindly collapsing identical records, I'd 
want to aggregate their values and emit the aggregate.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Partitioning reduce output by date

2008-03-20 Thread Otis Gospodnetic

Thank you, Doug and Ted, this pointed me in the right direction, which lead to 
a custom OutputFormat and a RecordWriter that opens and closes the 
DataOutputStream based on the current key (if current key diff from previous 
key, close previous output and open a new one, then write)

As for partitioning, that worked, too.  My getPartition method now has:

int dateHash = startDate.hashCode();
if (dateHash  0)
dateHash = -dateHash;
int partitionID = dateHash % numPartitions;
return partitionID;

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Doug Cutting [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Wednesday, March 19, 2008 4:39:04 PM
Subject: Re: Partitioning reduce output by date

Otis Gospodnetic wrote:
 That numPartitions corresponds to the number of reduce tasks.  What I need 
 is partitioning that corresponds to the number of unique dates (-mm-dd) 
 processed by the Mapper and not the number of reduce tasks.  I don't know the 
 number of distinct dates in the input ahead of time, though, so I cannot just 
 specify the same number of reduces.
 
 I *can* get the number of unique dates by keeping track of dates in map().  I 
 was going to take this approach and use this number in the getPartition() 
 method, but apparently getPartition(...) is called as each input row is 
 processed by map() call.  This causes a problem for me, as I know the total 
 number of unique dates only after *all* of the input is processed by map().

The number of partitions is indeed the number of reduces.  If you were 
to compute it during map, then each map might generate a different 
number.  Each map must partition into the same space, so that all 
partition 0 data can go to one reduce, partition 1 to another, and so on.

I think Ted pointed you in the right direction: your Partitioner should 
partition by the hash of the date, then your OutputFormat should start 
writing a new file each time the date changes.  That will give you a 
unique file per date.

Doug

RE: Partitioning reduce output by date

2008-03-20 Thread Runping Qi


If you want to output data to different files based on date or any value
parts, you may want to check
https://issues.apache.org/jira/browse/HADOOP-2906


Runping


 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Thursday, March 20, 2008 4:00 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Partitioning reduce output by date
 
 Thank you, Doug and Ted, this pointed me in the right direction, which
 lead to a custom OutputFormat and a RecordWriter that opens and closes
the
 DataOutputStream based on the current key (if current key diff from
 previous key, close previous output and open a new one, then
write)
 
 As for partitioning, that worked, too.  My getPartition method now
has:
 
 int dateHash = startDate.hashCode();
 if (dateHash  0)
 dateHash = -dateHash;
 int partitionID = dateHash % numPartitions;
 return partitionID;
 
 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Doug Cutting [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Wednesday, March 19, 2008 4:39:04 PM
 Subject: Re: Partitioning reduce output by date
 
 Otis Gospodnetic wrote:
  That numPartitions corresponds to the number of reduce tasks.
What I
 need is partitioning that corresponds to the number of unique dates
(-
 mm-dd) processed by the Mapper and not the number of reduce tasks.  I
 don't know the number of distinct dates in the input ahead of time,
 though, so I cannot just specify the same number of reduces.
 
  I *can* get the number of unique dates by keeping track of dates in
 map().  I was going to take this approach and use this number in the
 getPartition() method, but apparently getPartition(...) is called
as
 each input row is processed by map() call.  This causes a problem for
me,
 as I know the total number of unique dates only after *all* of the
input
 is processed by map().
 
 The number of partitions is indeed the number of reduces.  If you were
 to compute it during map, then each map might generate a different
 number.  Each map must partition into the same space, so that all
 partition 0 data can go to one reduce, partition 1 to another, and so
on.
 
 I think Ted pointed you in the right direction: your Partitioner
should
 partition by the hash of the date, then your OutputFormat should start
 writing a new file each time the date changes.  That will give you a
 unique file per date.
 
 Doug

Re: Limiting Total # of TaskTracker threads

2008-03-20 Thread Khalil Honsali

Hi,

The map/reduce tasks are not threads, they are run in separate JVMs
which are forked by the tasktracker.

I don't understand why? is it a design to support task failures? I think
that on the other hand running a thread queue (of tasks) per job per JVM
would grealy improve performance, since fewer JVM init times.

K. Honsali

On 21/03/2008, Jimmy Wan [EMAIL PROTECTED] wrote:

 On Tue, 18 Mar 2008 19:53:04 -0500, Ted Dunning [EMAIL PROTECTED] wrote:

  I think the original request was to limit the sum of maps and reduces
  rather than limiting the two parameters independently.


 Ted, yes this is exactly what I'm looking for. I just found an issue that
 seems to state that the old deprecated property is there, but it is not
 documented:

 https://issues.apache.org/jira/browse/HADOOP-2300

 I tried using the max tasks in combination with setting the new values,
 but that didn't seem to work. =( My machine labelled as LIMITED MACHINE
 had 2 maps and 1 reduce running at the same time.

 The scenario I have is that I want to run multiple concurrent jobs through
 my cluster and have the CPU usage for that node be bound. Should I file a
 new issue?

 This was all with Hadoop 0.16.0

 LIMITED MACHINE:
 property
   namemapred.tasktracker.tasks.maximum/name
   value2/value
   descriptionThe maximum number of total tasks that will be run
   simultaneously by a task tracker.
   /description
 /property
 property
   namemapred.tasktracker.map.tasks.maximum/name
   value1/value
   descriptionThe maximum number of map tasks that will be run
   simultaneously by a task tracker.
   /description
 /property
 property
   namemapred.tasktracker.reduce.tasks.maximum/name
   value1/value
   descriptionThe maximum number of reduce tasks that will be run
   simultaneously by a task tracker.
   /description
 /property

 OTHER CLUSTER MACHINES:
 property
   namemapred.tasktracker.tasks.maximum/name
   value8/value
   descriptionThe maximum number of total tasks that will be run
   simultaneously by a task tracker.
   /description
 /property
 property
   namemapred.tasktracker.map.tasks.maximum/name
   value4/value
   descriptionThe maximum number of map tasks that will be run
   simultaneously by a task tracker.
   /description
 /property
 property
   namemapred.tasktracker.reduce.tasks.maximum/name
   value4/value
   descriptionThe maximum number of reduce tasks that will be run
   simultaneously by a task tracker.
   /description
 /property


  On 3/18/08 5:26 PM, Arun C Murthy [EMAIL PROTECTED] wrote:
 

  The map/reduce tasks are not threads, they are run in separate JVMs
  which are forked by the tasktracker.


 Arun, yes, I did mean tasks, not threads.


 --

 Jimmy

Re: Hadoop streaming cacheArchive

Re: Trash option in hadoop-site.xml configuration.

MapFile and MapFileOutputFormat

RE: Trash option in hadoop-site.xml configuration.

Re: MapFile and MapFileOutputFormat

Re: Hadoop on EC2 for large cluster

Re: Hadoop on EC2 for large cluster

Re: Hadoop on EC2 for large cluster

Re: Hadoop on EC2 for large cluster

using a set of MapFiles - getting the right partition

Re: Hadoop streaming cacheArchive

Re: Limiting Total # of TaskTracker threads

Default Combiner or default combining behaviour?

Re: Partitioning reduce output by date

RE: Partitioning reduce output by date

Re: Limiting Total # of TaskTracker threads

16 matches

Site Navigation

Mail list logo

Footer information