RE: How can I control Number of Mappers of a job?
This can be done very easily setting the number of mappers you want - jobConf.setNumMapTasks() and use input format - MultiFileWordCount.MyInputFormat.class which is a concrete implementation of MultiFileInputFormat. -Original Message- From: Jason Venner [mailto:[EMAIL PROTECTED] Sent: Saturday, August 02, 2008 5:41 AM To: core-user@hadoop.apache.org Subject: Re: How can I control Number of Mappers of a job? We control the number of map tasks by carefully managing the input split size when we need to. This may require using the multiplefileinput classes or aggregating your input files before hand. You need to have some aggregation either by contactination or the MultipleFileInput if you have more input files than you want map tasks. The case of 1 mapper per input file requires setting the inputsplitsize to Long.MAX_SIZE (see the datajoin classes for examples) paul wrote: I've talked to a few people that claim to have done this as a way to limit resources for different groups, like developers versus production jobs. Haven't tried it myself yet, but it's getting close to the top of my to-do list. -paul On Fri, Aug 1, 2008 at 1:36 PM, James Moore [EMAIL PROTECTED] wrote: On Thu, Jul 31, 2008 at 12:30 PM, Gopal Gandhi [EMAIL PROTECTED] wrote: Thank you, finally someone has interests in my questions =) My cluster contains more than one machine. Please don't get me wrong :-). I don't want to limit the total mappers in one node (by mapred.map.tasks). What I want is to limit the total mappers for one job. The motivation is that I have 2 jobs to run at the same time. they have the same input data in Hadoop. I found that one job has to wait until the other finishes its mapping. Because the 2 jobs are submitted by 2 different people, I don't want one job to be starving. So I want to limit the first job's total mappers so that the 2 jobs will be launched simultaneously. What about running two different jobtrackers on the same machines, looking at the same DFS files? Never tried it myself, but it might be an approach. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com -- Jason Venner Attributor - Program the Web http://www.attributor.com/ Attributor is hiring Hadoop Wranglers and coding wizards, contact if interested
EOFException while starting name node
I'm getting the following exceptions while starting the name node - ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804) Is there a way to recover the name node without losing any data. Thanks, Amol
Re: EOFException while starting name node
I have the same thing: ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804) I would appreciate any advice. I tried to move the 'edits' file and recreate a new one, but that did not work. Thanks, S. On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote: I'm getting the following exceptions while starting the name node - ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804) Is there a way to recover the name node without losing any data. Thanks, Amol
Re: EOFException while starting name node
2008-08-03 21:58:33,108 INFO org.apache.hadoop.ipc.Server: Stopping server on 9000 2008-08-03 21:58:33,109 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java: 222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) Actually my exception is slightly different than yours. Maybe moving edits file and recreating a new one will work for you. On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote: I'm getting the following exceptions while starting the name node - ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804) Is there a way to recover the name node without losing any data. Thanks, Amol
Re: EOFException while starting name node
We had seen similar exception earlier reported by others on the list. What you might want to try is to use a hex editor or equivalent to open up 'edits' and get rid of the last record. In all cases, the last record might not be complete so your namenode is not starting. Once you update your edits, start the namenode and run 'hadoop fsck /' to see if you have any corrupt files and fix/get rid of them. PS : Take a back up of dfs.name.dir before updating and playing around with it. Thanks, Lohit - Original Message From: steph [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Monday, August 4, 2008 8:31:07 AM Subject: Re: EOFException while starting name node 2008-08-03 21:58:33,108 INFO org.apache.hadoop.ipc.Server: Stopping server on 9000 2008-08-03 21:58:33,109 ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java: 222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:235) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) Actually my exception is slightly different than yours. Maybe moving edits file and recreating a new one will work for you. On Aug 4, 2008, at 2:53 AM, Wanjari, Amol wrote: I'm getting the following exceptions while starting the name node - ERROR dfs.NameNode: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:87) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:455) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:733) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:620) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:76) at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:221) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:130) at org.apache.hadoop.dfs.NameNode.init(NameNode.java:168) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:795) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:804) Is there a way to recover the name node without losing any data. Thanks, Amol
having different HADOOP_HOME for master and slaves?
I'm trying to set up 2 Hadoop installations on my master node, one of which will have permissions that allow more users to run Hadoop. But I don't really need anything different on the datanodes, so I'd like to keep those as-is. With that switch, the HADOOP_HOME on the master will be different from that on the datanodes. After shutting down the old hadoop, I tried to start-all the new one, and encountered this: $ bin/stop-all.sh no jobtracker to stop node2: bash: line 0: cd: /new/dir/hadoop/bin/..: No such file or directory node2: bash: /new/dir/hadoop/bin/hadoop-daemon.sh: No such file or directory I consulted the documentation at: http://hadoop.apache.org/core/docs/current/cluster_setup.html#Installation which only has 2 bits of info on this -- 1) The root of the distribution is referred to as HADOOP_HOME. All machines in the cluster usually have the same HADOOP_HOME path. and 2) Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines, typically ${HADOOP_HOME}/conf. So I forgot to do anything about the second instruction. After doing so, I got: $ bin/stop-all.sh no jobtracker to stop node2: bash: /new/dir/hadoop/bin/hadoop-daemon.sh: No such file or directory Ok, it found the config dir, but now it expects the binary to be located at the same HADOOP_HOME that the master uses? I suppose I could, for each datanode, symlink things to point to the actual Hadoop installation. But really, I would like the setup that is hinted as possible by statement 1). Is there a way I could do it, or should that bit of documentation read, All machines in the cluster _must_ have the same HADOOP_HOME? Thanks!
Re: having different HADOOP_HOME for master and slaves?
On 8/4/08 11:10 AM, Meng Mao [EMAIL PROTECTED] wrote: I suppose I could, for each datanode, symlink things to point to the actual Hadoop installation. But really, I would like the setup that is hinted as possible by statement 1). Is there a way I could do it, or should that bit of documentation read, All machines in the cluster _must_ have the same HADOOP_HOME? If you run the -all scripts, they assume the location is the same. AFAIK, there is nothing preventing you from building your own -all scripts that point to the different location to start/stop the data nodes.
data partitioning question
Hi, I want to implement some data partitioning logic where a mapper is assigned a specific range of values. Here is a concrete example of what I have in mind: Suppose I have attributes A, B, C and the following tuples: (A, B, C) (1, 3, 1) (1, 2, 2) (1, 2, 3) (12, 3, 4) (12, 2, 5) (12, 8, 6) (12, 2, 7) What I want to do is assign mapper x all the tuples where the C attribute = 1, 3, 5, and 7. 1-Is it possible to write a smart InputFormat class that can assign a set of records to a specific mapper? If so, how? 2-How will this type of partitioning logic interact with HDFS data locality? Thanks, Shirley
Re: data partitioning question
For the first question, I think it is better to do it at reduce stage, because the partitioner only consider the size of blocks in bytes. Instead you can output the intermediate key/value pair as this: key: 1 if C=1,3,5,7. 0 otherwise value: the tuple. In reducer you can have a reducer deal with all the key with c=1,3,5,7. On Mon, Aug 4, 2008 at 3:29 PM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I want to implement some data partitioning logic where a mapper is assigned a specific range of values. Here is a concrete example of what I have in mind: Suppose I have attributes A, B, C and the following tuples: (A, B, C) (1, 3, 1) (1, 2, 2) (1, 2, 3) (12, 3, 4) (12, 2, 5) (12, 8, 6) (12, 2, 7) What I want to do is assign mapper x all the tuples where the C attribute = 1, 3, 5, and 7. 1-Is it possible to write a smart InputFormat class that can assign a set of records to a specific mapper? If so, how? 2-How will this type of partitioning logic interact with HDFS data locality? Thanks, Shirley
Re: having different HADOOP_HOME for master and slaves?
I see. I think I could also modify the hadoop-env.sh in the new conf/ folders per datanode to point to the right place for HADOOP_HOME. On Mon, Aug 4, 2008 at 3:21 PM, Allen Wittenauer [EMAIL PROTECTED] wrote: On 8/4/08 11:10 AM, Meng Mao [EMAIL PROTECTED] wrote: I suppose I could, for each datanode, symlink things to point to the actual Hadoop installation. But really, I would like the setup that is hinted as possible by statement 1). Is there a way I could do it, or should that bit of documentation read, All machines in the cluster _must_ have the same HADOOP_HOME? If you run the -all scripts, they assume the location is the same. AFAIK, there is nothing preventing you from building your own -all scripts that point to the different location to start/stop the data nodes. -- hustlin, hustlin, everyday I'm hustlin
Examples of using DFS without MapReduce
Hi there, I am trying to use the DFS of hadoop in other applications. It is not clear to me how that could be carried out easily. Could any one give a direction to go or examples? Thank you. -Kevin
Re: Examples of using DFS without MapReduce
Thank you! The java code is exactly what I want. Following your code, I encounter the user permission issue when trying to write to a file. I wonder if the user id could be manipulated in the protocol. -Kevin On Mon, Aug 4, 2008 at 2:27 PM, Michael Bieniosek [EMAIL PROTECTED] wrote: You can make shell calls: hadoop/bin/hadoop fs -fs namenode.example.com:1 -ls / If you're in java, you can use the org.apache.hadoop.fs.FileSystem class: Configuration config = new Configuration(); config.set(fs.default.name, namenode.example.com:1) FileSystem fs = FileSystem.get(config); fs.listStatus(new Path(/)) -Michael On 8/4/08 1:53 PM, Kevin [EMAIL PROTECTED] wrote: Hi there, I am trying to use the DFS of hadoop in other applications. It is not clear to me how that could be carried out easily. Could any one give a direction to go or examples? Thank you. -Kevin
Re: mapper input file name
OK. I guess I find out how. Override the configure method of user defined Map class so that you can take note of the filename. -Kevin On Mon, Aug 4, 2008 at 3:53 PM, Kevin [EMAIL PROTECTED] wrote: Is it possible to get this information in user defined map function? i.e., how do we get the JobConf object in map() function? Another way is to subclass RecordReader to embed file-name in the data, which does not look simple. -Kevin On Sun, Aug 3, 2008 at 10:17 PM, Amareshwari Sriramadasu [EMAIL PROTECTED] wrote: You can get the file name accessed by the mapper using the config property map.input.file Thanks Amareshwari Deyaa Adranale wrote: Hi, I need to know inside my mapper, the name of the file that contains the current record. I saw that I can access the name of the input directories inside mapper.config(), but my input contains different files and I need to know the name of the current one. any hints? thanks in advance, Deyaa
Re: data partitioning question
Thanks, Qin. It sounds like you're saying that this type of partitioning needs its own map-reduce set. I was hoping it could be done in the InputFormat class :)) Shirley On Aug 4, 2008, at 2:49 PM, Qin Gao wrote: For the first question, I think it is better to do it at reduce stage, because the partitioner only consider the size of blocks in bytes. Instead you can output the intermediate key/value pair as this: key: 1 if C=1,3,5,7. 0 otherwise value: the tuple. In reducer you can have a reducer deal with all the key with c=1,3,5,7. On Mon, Aug 4, 2008 at 3:29 PM, Shirley Cohen [EMAIL PROTECTED] wrote: Hi, I want to implement some data partitioning logic where a mapper is assigned a specific range of values. Here is a concrete example of what I have in mind: Suppose I have attributes A, B, C and the following tuples: (A, B, C) (1, 3, 1) (1, 2, 2) (1, 2, 3) (12, 3, 4) (12, 2, 5) (12, 8, 6) (12, 2, 7) What I want to do is assign mapper x all the tuples where the C attribute = 1, 3, 5, and 7. 1-Is it possible to write a smart InputFormat class that can assign a set of records to a specific mapper? If so, how? 2-How will this type of partitioning logic interact with HDFS data locality? Thanks, Shirley