RE: How can I control Number of Mappers of a job?

2008-08-04 Thread Goel, Ankur
This can be done very easily setting the number of mappers you want -
jobConf.setNumMapTasks() and use input format -
MultiFileWordCount.MyInputFormat.class which is a concrete
implementation of MultiFileInputFormat.

-Original Message-
From: Jason Venner [mailto:[EMAIL PROTECTED] 
Sent: Saturday, August 02, 2008 5:41 AM
To: core-user@hadoop.apache.org
Subject: Re: How can I control Number of Mappers of a job?

We control the number of map tasks by carefully managing the input split

size when we need to.
This may require using the multiplefileinput classes or aggregating your

input files before hand.
You need to have some aggregation either by contactination or the 
MultipleFileInput if you have more input files than you want map tasks.



The case of 1 mapper per input file requires setting the inputsplitsize 
to Long.MAX_SIZE (see the datajoin classes for examples)



paul wrote:
 I've talked to a few people that claim to have done this as a way to
limit
 resources for different groups, like developers versus production
jobs.
 Haven't tried it myself yet, but it's getting close to the top of my
to-do
 list.


 -paul


 On Fri, Aug 1, 2008 at 1:36 PM, James Moore [EMAIL PROTECTED]
wrote:

   
 On Thu, Jul 31, 2008 at 12:30 PM, Gopal Gandhi
 [EMAIL PROTECTED] wrote:
 
 Thank you, finally someone has interests in my questions =)
 My cluster contains more than one machine. Please don't get me wrong
:-).
   
 I don't want to limit the total mappers in one node (by
mapred.map.tasks).
 What I want is to limit the total mappers for one job. The motivation
is
 that I have 2 jobs to run at the same time. they have the same input
data
 in Hadoop. I found that one job has to wait until the other finishes
its
 mapping. Because the 2 jobs are submitted by 2 different people, I
don't
 want one job to be starving. So I want to limit the first job's total
 mappers so that the 2 jobs will be launched simultaneously.

 What about running two different jobtrackers on the same machines,
 looking at the same DFS files?  Never tried it myself, but it might
be
 an approach.

 --
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com

 

   

-- 
Jason Venner
Attributor - Program the Web http://www.attributor.com/
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


EOFException while starting name node

2008-08-04 Thread Wanjari, Amol
I'm getting the following exceptions while starting the name node -

ERROR dfs.NameNode: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87)
at
org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620)
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
at
org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168)
at
org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804)

Is there a way to recover the name node without losing any data.

Thanks,
Amol 


Re: EOFException while starting name node

2008-08-04 Thread steph



I have the same thing:
ERROR dfs.NameNode: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87)
   at
org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455)
   at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733)
   at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620)
   at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
   at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
   at
org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168)
   at
org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795)
   at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804)



I would appreciate any advice. I tried to move the 'edits' file and  
recreate a new one,

but that did not work.

Thanks,

S.

On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote:


I'm getting the following exceptions while starting the name node -

ERROR dfs.NameNode: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87)
   at
org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455)
   at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733)
   at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620)
   at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
   at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
   at
org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168)
   at
org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795)
   at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804)

Is there a way to recover the name node without losing any data.

Thanks,
Amol




Re: EOFException while starting name node

2008-08-04 Thread steph


2008-08-03 21:58:33,108 INFO org.apache.hadoop.ipc.Server: Stopping  
server on 9000
2008-08-03 21:58:33,109 ERROR org.apache.hadoop.dfs.NameNode:  
java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:178)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
	at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java: 
222)

at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)


Actually my exception is slightly different than yours. Maybe moving  
edits file

and recreating a new one will work for you.


On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote:


I'm getting the following exceptions while starting the name node -

ERROR dfs.NameNode: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at
org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87)
   at
org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455)
   at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733)
   at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620)
   at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
   at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
   at
org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
   at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168)
   at
org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795)
   at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804)

Is there a way to recover the name node without losing any data.

Thanks,
Amol




Re: EOFException while starting name node

2008-08-04 Thread lohit
We had seen similar exception earlier reported by others on the list. What you 
might want to try is to use a hex editor or equivalent to open up 'edits' and 
get rid of the last record. In all cases, the last record might not be complete 
so your namenode is not starting. Once you update your edits, start the 
namenode and run 'hadoop fsck /' to see if you have any corrupt files and 
fix/get rid of them. 
PS : Take a back up of  dfs.name.dir before updating and playing around with it.

Thanks,
Lohit



- Original Message 
From: steph [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Monday, August 4, 2008 8:31:07 AM
Subject: Re: EOFException while starting name node


2008-08-03 21:58:33,108 INFO org.apache.hadoop.ipc.Server: Stopping  
server on 9000
2008-08-03 21:58:33,109 ERROR org.apache.hadoop.dfs.NameNode:  
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java: 
222)
at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)


Actually my exception is slightly different than yours. Maybe moving  
edits file
and recreating a new one will work for you.


On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote:

 I'm getting the following exceptions while starting the name node -

 ERROR dfs.NameNode: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
 org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87)
at
 org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620)
at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76)
at
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168)
at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804)

 Is there a way to recover the name node without losing any data.

 Thanks,
 Amol


having different HADOOP_HOME for master and slaves?

2008-08-04 Thread Meng Mao
I'm trying to set up 2 Hadoop installations on my master node, one of which
will have permissions that allow more users to run Hadoop.
But I don't really need anything different on the datanodes, so I'd like to
keep those as-is. With that switch, the HADOOP_HOME on the master will be
different from that on the datanodes.

After shutting down the old hadoop, I tried to start-all the new one, and
encountered this:
$ bin/stop-all.sh
no jobtracker to stop
node2: bash: line 0: cd: /new/dir/hadoop/bin/..: No such file or directory
node2: bash: /new/dir/hadoop/bin/hadoop-daemon.sh: No such file or directory

I consulted the documentation at:
http://hadoop.apache.org/core/docs/current/cluster_setup.html#Installation
which only has 2 bits of info on this --
1) The root of the distribution is referred to as HADOOP_HOME. All machines
in the cluster usually have the same HADOOP_HOME path.
and
2) Once all the necessary configuration is complete, distribute the files
to the HADOOP_CONF_DIR directory on all the machines, typically
${HADOOP_HOME}/conf.

So I forgot to do anything about the second instruction. After doing so, I
got:
$ bin/stop-all.sh
no jobtracker to stop
node2: bash: /new/dir/hadoop/bin/hadoop-daemon.sh: No such file or directory

Ok, it found the config dir, but now it expects the binary to be located at
the same HADOOP_HOME that the master uses?

I suppose I could, for each datanode, symlink things to point to the actual
Hadoop installation. But really, I would like the setup that is hinted as
possible by statement 1). Is there a way I could do it, or should that bit
of documentation read, All machines in the cluster _must_ have the same
HADOOP_HOME?

Thanks!


Re: having different HADOOP_HOME for master and slaves?

2008-08-04 Thread Allen Wittenauer



On 8/4/08 11:10 AM, Meng Mao [EMAIL PROTECTED] wrote:
 I suppose I could, for each datanode, symlink things to point to the actual
 Hadoop installation. But really, I would like the setup that is hinted as
 possible by statement 1). Is there a way I could do it, or should that bit
 of documentation read, All machines in the cluster _must_ have the same
 HADOOP_HOME?

If you run the -all scripts, they assume the location is the same.
AFAIK, there is nothing preventing you from building your own -all scripts
that point to the different location to start/stop the data nodes.




data partitioning question

2008-08-04 Thread Shirley Cohen

Hi,

I want to implement some data partitioning logic where a mapper is  
assigned a specific range of values. Here is a concrete example of  
what I have in mind:


Suppose I have attributes A, B, C and the following tuples:

(A, B, C)
(1, 3, 1)
(1, 2, 2)
(1, 2, 3)
(12, 3, 4)
(12, 2, 5)
(12, 8, 6)
(12,  2, 7)

What I want to do is assign mapper x all the tuples where the C  
attribute = 1, 3, 5, and 7.


1-Is it possible to write a smart InputFormat class that can assign a  
set of records to a specific mapper? If so, how?
2-How will this type of partitioning logic interact with HDFS data  
locality?



Thanks,

Shirley



Re: data partitioning question

2008-08-04 Thread Qin Gao
For the first question, I think it is better to do it at reduce stage,
because the partitioner only consider the size of blocks in bytes. Instead
you can output the intermediate key/value pair as this:

key: 1 if C=1,3,5,7. 0 otherwise
value: the tuple.

In reducer you can have a reducer deal with all the key with c=1,3,5,7.

On Mon, Aug 4, 2008 at 3:29 PM, Shirley Cohen [EMAIL PROTECTED] wrote:

 Hi,

 I want to implement some data partitioning logic where a mapper is assigned
 a specific range of values. Here is a concrete example of what I have in
 mind:

 Suppose I have attributes A, B, C and the following tuples:

 (A, B, C)
 (1, 3, 1)
 (1, 2, 2)
 (1, 2, 3)
 (12, 3, 4)
 (12, 2, 5)
 (12, 8, 6)
 (12,  2, 7)

 What I want to do is assign mapper x all the tuples where the C attribute =
 1, 3, 5, and 7.

 1-Is it possible to write a smart InputFormat class that can assign a set
 of records to a specific mapper? If so, how?
 2-How will this type of partitioning logic interact with HDFS data
 locality?


 Thanks,

 Shirley




Re: having different HADOOP_HOME for master and slaves?

2008-08-04 Thread Meng Mao
I see. I think I could also modify the hadoop-env.sh in the new conf/
folders per datanode to point
to the right place for HADOOP_HOME.

On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer [EMAIL PROTECTED] wrote:




 On 8/4/08 11:10 AM, Meng Mao [EMAIL PROTECTED] wrote:
  I suppose I could, for each datanode, symlink things to point to the
 actual
  Hadoop installation. But really, I would like the setup that is hinted as
  possible by statement 1). Is there a way I could do it, or should that
 bit
  of documentation read, All machines in the cluster _must_ have the same
  HADOOP_HOME?

 If you run the -all scripts, they assume the location is the same.
 AFAIK, there is nothing preventing you from building your own -all scripts
 that point to the different location to start/stop the data nodes.





-- 
hustlin, hustlin, everyday I'm hustlin


Examples of using DFS without MapReduce

2008-08-04 Thread Kevin
Hi there,

I am trying to use the DFS of hadoop in other applications. It is not
clear to me how that could be carried out easily. Could any one give a
direction to go or examples? Thank you.

-Kevin


Re: Examples of using DFS without MapReduce

2008-08-04 Thread Kevin
Thank you! The java code is exactly what I want.

Following your code, I encounter the user permission issue when trying
to write to a file. I wonder if the user id could be manipulated in
the protocol.

-Kevin



On Mon, Aug 4, 2008 at 2:27 PM, Michael Bieniosek [EMAIL PROTECTED] wrote:
 You can make shell calls:

 hadoop/bin/hadoop fs -fs namenode.example.com:1 -ls /

 If you're in java, you can use the org.apache.hadoop.fs.FileSystem class:

 Configuration config = new Configuration();
 config.set(fs.default.name, namenode.example.com:1)
 FileSystem fs = FileSystem.get(config);
 fs.listStatus(new Path(/))

 -Michael

 On 8/4/08 1:53 PM, Kevin [EMAIL PROTECTED] wrote:

 Hi there,

 I am trying to use the DFS of hadoop in other applications. It is not
 clear to me how that could be carried out easily. Could any one give a
 direction to go or examples? Thank you.

 -Kevin




Re: mapper input file name

2008-08-04 Thread Kevin
OK. I guess I find out how. Override the configure method of user
defined Map class so that you can take note of the filename.

-Kevin



On Mon, Aug 4, 2008 at 3:53 PM, Kevin [EMAIL PROTECTED] wrote:
 Is it possible to get this information in user defined map function?
 i.e., how do we get the JobConf object in map() function?

 Another way is to subclass RecordReader to embed file-name in the
 data, which does not look simple.

 -Kevin



 On Sun, Aug 3, 2008 at 10:17 PM, Amareshwari Sriramadasu
 [EMAIL PROTECTED] wrote:
 You can get the file name accessed by the mapper using the config property
 map.input.file

 Thanks
 Amareshwari
 Deyaa Adranale wrote:

 Hi,

 I need to know inside my mapper, the name of the file that contains the
 current record.
 I saw that I can access the name of the input directories inside
 mapper.config(), but my input contains different files and I need to know
 the name of the current one.

 any hints?

 thanks in advance,

 Deyaa





Re: data partitioning question

2008-08-04 Thread Shirley Cohen
Thanks, Qin. It sounds like you're saying that this type of  
partitioning needs its own map-reduce set.


I was hoping it could be done in the InputFormat class :))

Shirley

On Aug 4, 2008, at 2:49 PM, Qin Gao wrote:


For the first question, I think it is better to do it at reduce stage,
because the partitioner only consider the size of blocks in bytes.  
Instead

you can output the intermediate key/value pair as this:

key: 1 if C=1,3,5,7. 0 otherwise
value: the tuple.

In reducer you can have a reducer deal with all the key with  
c=1,3,5,7.


On Mon, Aug 4, 2008 at 3:29 PM, Shirley Cohen  
[EMAIL PROTECTED] wrote:



Hi,

I want to implement some data partitioning logic where a mapper is  
assigned
a specific range of values. Here is a concrete example of what I  
have in

mind:

Suppose I have attributes A, B, C and the following tuples:

(A, B, C)
(1, 3, 1)
(1, 2, 2)
(1, 2, 3)
(12, 3, 4)
(12, 2, 5)
(12, 8, 6)
(12,  2, 7)

What I want to do is assign mapper x all the tuples where the C  
attribute =

1, 3, 5, and 7.

1-Is it possible to write a smart InputFormat class that can  
assign a set

of records to a specific mapper? If so, how?
2-How will this type of partitioning logic interact with HDFS data
locality?


Thanks,

Shirley