Re: next gen map reduce
On Thu, 28 Jul 2011 06:13:01 -0700 Thomas Graves tgra...@yahoo-inc.com wrote: Its currently still on the MR279 branch - http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/. It is planned to be merged to trunk soon. Tom On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com wrote: In which Hadoop version is next gen introduced? Hi, what exactly is contained within this next generation mysterious sounding MRV2? What's it about? Dieter
RE: Moving Files to Distributed Cache in MapReduce
Yeah, I'll write something up and post it on my web site. Definitely not InfoQ stuff, but a simple tip and tricks stuff. -Mike Subject: Re: Moving Files to Distributed Cache in MapReduce From: a...@apache.org Date: Sun, 31 Jul 2011 19:21:14 -0700 To: common-user@hadoop.apache.org We really need to build a working example to the wiki and add a link from the FAQ page. Any volunteers? On Jul 29, 2011, at 7:49 PM, Michael Segel wrote: Here's the meat of my post earlier... Sample code on putting a file on the cache: DistributedCache.addCacheFile(new URI(path+MyFileName,conf)); Sample code in pulling data off the cache: private Path[] localFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration()); boolean exitProcess = false; int i=0; while (!exit){ fileName = localFiles[i].getName(); if (fileName.equalsIgnoreCase(model.txt)){ // Build your input file reader on localFiles[i].toString() exitProcess = true; } i++; } Note that this is SAMPLE code. I didn't trap the exit condition if the file isn't there and you go beyond the size of the array localFiles[]. Also I set exit to false because its easier to read this as Do this loop until the condition exitProcess is true. When you build your file reader you need the full path, not just the file name. The path will vary when the job runs. HTH -Mike From: michael_se...@hotmail.com To: common-user@hadoop.apache.org Subject: RE: Moving Files to Distributed Cache in MapReduce Date: Fri, 29 Jul 2011 21:43:37 -0500 I could have sworn that I gave an example earlier this week on how to push and pull stuff from distributed cache. Date: Fri, 29 Jul 2011 14:51:26 -0700 Subject: Re: Moving Files to Distributed Cache in MapReduce From: rogc...@ucdavis.edu To: common-user@hadoop.apache.org jobConf is deprecated in 0.20.2 I believe; you're supposed to be using Configuration for that On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia mohitanch...@gmail.comwrote: Is this what you are looking for? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html search for jobConf On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote: Thanks for the response! However, I'm having an issue with this line Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf); because conf has private access in org.apache.hadoop.configured On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com wrote: I hope my previous reply helps... On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu wrote: After moving it to the distributed cache, how would I call it within my MapReduce program? On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn mapred.le...@gmail.com wrote: Did you try using -files option in your hadoop jar command as: /usr/bin/hadoop jar jar name main class name -files absolute path of file to be added to distributed cache input dir output dir On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu wrote: Slight modification: I now know how to add files to the distributed file cache, which can be done via this command placed in the main or run class: DistributedCache.addCacheFile(new URI(/user/hadoop/thefile.dat), conf); However I am still having trouble locating the file in the distributed cache. *How do I call the file path of thefile.dat in the distributed cache as a string?* I am using Hadoop 0.20.2 On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu wrote: Hi all, Does anybody have examples of how one moves files from the local filestructure/HDFS to the distributed cache in MapReduce? A Google search turned up examples in Pig but not MR. -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center -- Roger Chen UC Davis Genome Center
Re: next gen map reduce
The jira has more details and an architecture doc attached. https://issues.apache.org/jira/browse/MAPREDUCE-279 Tom On 8/1/11 2:12 AM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: On Thu, 28 Jul 2011 06:13:01 -0700 Thomas Graves tgra...@yahoo-inc.com wrote: Its currently still on the MR279 branch - http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/. It is planned to be merged to trunk soon. Tom On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com wrote: In which Hadoop version is next gen introduced? Hi, what exactly is contained within this next generation mysterious sounding MRV2? What's it about? Dieter
Using -libjar option
Hello All, I am new to Hadoop, and I am trying to use the GenericOptionsParser Class. In particular, I would like to use the -libjar option to specify additional jar files to include in the classpath. I've created a class that extends Configured and Implements Tool: *public class* OptionDemo *extends* Configured *implements* Tool { ... *public int* run(String[] args) *throws* Exception { Configuration conf = getConf(); GenericOptionsParser opts = *new* GenericOptionsParser(conf, args); ... } } However, when I run my code the jar files that I include after -libjar aren't being added to the classpath and I receive an error that certain classes can't be found during the execution of my job. The book Hadoop: The Definitive Guide states: You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally: public interface Tool extends Configurable { int run(String [] args) throws Exception; } but it still isn't clear to me how the -libjars option is parsed, whether or not I need to explicitly add it to the classpath inside my run method, or where it needs to be placed in the command-line? Any advice or sample code on using -libjar would greatly be appreciated. -- Aquil H. Abdullah aquil.abdul...@gmail.com
Re: Using -libjar option
On Mon, 1 Aug 2011 12:11:27 -0400, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: but it still isn't clear to me how the -libjars option is parsed, whether or not I need to explicitly add it to the classpath inside my run method, or where it needs to be placed in the command-line? IIRC it's parsed as a comma-separated list of file paths relative to your current working directory, and the local copies that it makes on each cluster node are automatically added to the tasks' classpaths. Can you give an example of how you're trying to use it?
Re: Using -libjar option
Aquil, On a side-note, if you use Tool, GenericOptsParser is automatically used internally (by ToolRunner), so you don't have to re-parse your args in your run(…) method. What you get as run(args) are the remnant args alone, if your application handles any. Would help, as John pointed out, if you could give your exact, invoking CLI command. On Mon, Aug 1, 2011 at 9:41 PM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: Hello All, I am new to Hadoop, and I am trying to use the GenericOptionsParser Class. In particular, I would like to use the -libjar option to specify additional jar files to include in the classpath. I've created a class that extends Configured and Implements Tool: *public class* OptionDemo *extends* Configured *implements* Tool { ... * public int* run(String[] args) *throws* Exception { Configuration conf = getConf(); GenericOptionsParser opts = *new* GenericOptionsParser(conf, args); ... } } However, when I run my code the jar files that I include after -libjar aren't being added to the classpath and I receive an error that certain classes can't be found during the execution of my job. The book Hadoop: The Definitive Guide states: You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally: public interface Tool extends Configurable { int run(String [] args) throws Exception; } but it still isn't clear to me how the -libjars option is parsed, whether or not I need to explicitly add it to the classpath inside my run method, or where it needs to be placed in the command-line? Any advice or sample code on using -libjar would greatly be appreciated. -- Aquil H. Abdullah aquil.abdul...@gmail.com -- Harsh J
Re: Using -libjar option
[See Response Inline] I've tried invoking getLib On Mon, Aug 1, 2011 at 12:56 PM, Harsh J ha...@cloudera.com wrote: Aquil, On a side-note, if you use Tool, GenericOptsParser is automatically used internally (by ToolRunner), so you don't have to re-parse your args in your run(…) method. What you get as run(args) are the remnant args alone, if your application handles any. [AA] Thanks for clearing that up! Would help, as John pointed out, if you could give your exact, invoking CLI command. [AA] I am currently invoking my application as follows: hadoop jar /home/test/hadoop/test.option.demo.jar test.option.demo.OptionDemo -libjar /home/test/hadoop/lib/mytestlib.jar On Mon, Aug 1, 2011 at 9:41 PM, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: Hello All, I am new to Hadoop, and I am trying to use the GenericOptionsParser Class. In particular, I would like to use the -libjar option to specify additional jar files to include in the classpath. I've created a class that extends Configured and Implements Tool: *public class* OptionDemo *extends* Configured *implements* Tool { ... *public int* run(String[] args) *throws* Exception { Configuration conf = getConf(); GenericOptionsParser opts = *new* GenericOptionsParser(conf, args); ... } } However, when I run my code the jar files that I include after -libjar aren't being added to the classpath and I receive an error that certain classes can't be found during the execution of my job. The book Hadoop: The Definitive Guide states: You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally: public interface Tool extends Configurable { int run(String [] args) throws Exception; } but it still isn't clear to me how the -libjars option is parsed, whether or not I need to explicitly add it to the classpath inside my run method, or where it needs to be placed in the command-line? Any advice or sample code on using -libjar would greatly be appreciated. -- Aquil H. Abdullah aquil.abdul...@gmail.com -- Harsh J -- Aquil H. Abdullah aquil.abdul...@gmail.com
Re: Using -libjar option
On Mon, 1 Aug 2011 13:21:27 -0400, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: [AA] I am currently invoking my application as follows: hadoop jar /home/test/hadoop/test.option.demo.jar test.option.demo.OptionDemo -libjar /home/test/hadoop/lib/mytestlib.jar I believe the problem might be that it's looking for -libjars, not -libjar.
Mappers fail to initialize and are killed after 600 seconds
Hi all, I'm running a simple mapreduce job that connects to an hbase table, reads each row, counts some co-occurrence frequencies, and writes everything out to hdfs at the end. Everything seems to be going smoothly until the last 5, out of 108, tasks run. The last 5 tasks seem to be stuck initializing. As far as I can tell, setup is never called, and eventually, after 600 seconds, the task is killed. The task jumps around different nodes to try and run but regardless of the node, it fails to initialize and is killed. My first guess is that it's trying to connect to an hbase region server and failing, but I don't see anything like this in the task tracker nodes. Here are the log lines related to one of the failed tasks from the task trackers logs: 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's state:UNASSIGNED 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:01:12,243 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201107281508_0028_m_-1189914759 given task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:09,462 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0: Task attempt_201107281508_0028_m_27_0 failed to report status for 600 seconds. Killing! 2011-08-01 12:11:09,467 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:14,488 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. 2011-08-01 12:11:14,489 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_201107281508_0028_m_27_0 not found in cache 2011-08-01 12:11:14,495 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's state:FAILED_UNCLEAN 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:11:15,045 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201107281508_0028_m_-1869983962 given task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:15,346 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0 0.0% 2011-08-01 12:11:15,348 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0 0.0% cleanup 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201107281508_0028_m_27_0 is done. 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201107281508_0028_m_27_0 was -1 2011-08-01 12:11:15,354 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. 2011-08-01 12:11:17,495 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. And here are the syslog lines: In my job, I set the stats when i enter and exit setup, and I set counters in map. None of these are triggered for this task. Nothing is written to stderr or stdout, and the syslogs for the task have nothing beyond the zookeeper client connection lines. Any thoughts as to what might be causing this issue? Is there another log that indicates which region server this task is trying to connect to? Thanks! --Keith Stevens
Re: Using -libjar option
Don't I feel sheepish... OK, so I've hacked this sample code below, from the ConfigurationPrinter example in Hadoop: The Definitive Guide. If -libjars had been added to the configuration I would expect to see it when I iterate over the urls, however I see it as one of the remaining options: ***OUTPUT*** remaining args -libjars remaining args C:\Apps\mahout-distribution-0.5\mahout-core-0.5.jar *** [Source Code] package test.option.demo; import org.apache.hadoop.conf.*; import org.apache.hadoop.util.*; // import java.util.*; import java.net.URL; // import java.util.Map.Entry; public class OptionDemo extends Configured implements Tool{ static { Configuration.addDefaultResource(hdfs-default.xml); Configuration.addDefaultResource(hdfs-site.xml); Configuration.addDefaultResource(mapred-default.xml); Configuration.addDefaultResource(mapred-site.xml); } @Override public int run(String[] args) throws Exception { GenericOptionsParser opt = new GenericOptionsParser(args); Configuration conf = opt.getConfiguration(); // for (EntryString, String entry: conf) // { // System.out.printf(%s=%s\n, entry.getKey(), entry.getValue()); // } for (int i = 0; i args.length;i++) { System.out.printf(remaining args %s\n, args[i]); } URL[] urls = GenericOptionsParser.getLibJars(conf); if (urls != null) { for (int j = 0; j urls.length;j++) { System.out.printf(url[%d] %s, j, urls[j].toString()); }else System.out.println(No libraries added to configuration); } } return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new OptionDemo(), args); System.exit(exitCode); } } On Mon, Aug 1, 2011 at 2:17 PM, John Armstrong john.armstr...@ccri.comwrote: On Mon, 1 Aug 2011 13:21:27 -0400, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: [AA] I am currently invoking my application as follows: hadoop jar /home/test/hadoop/test.option.demo.jar test.option.demo.OptionDemo -libjar /home/test/hadoop/lib/mytestlib.jar I believe the problem might be that it's looking for -libjars, not -libjar. -- Aquil H. Abdullah aquil.abdul...@gmail.com
Re: Using -libjar option
On Mon, 1 Aug 2011 15:30:49 -0400, Aquil H. Abdullah aquil.abdul...@gmail.com wrote: Don't I feel sheepish... Happens to the best, or so they tell me. OK, so I've hacked this sample code below, from the ConfigurationPrinter example in Hadoop: The Definitive Guide. If -libjars had been added to the configuration I would expect to see it when I iterate over the urls, however I see it as one of the remaining options: It might help you to read over the source code of the ToolRunner class. I know it did for me.
Re: Mappers fail to initialize and are killed after 600 seconds
Are there no userlogs from the failed tasks? TaskTracker logs won't carry user-code (task) logs. Could you paste those syslog lines (from the task) to pastebin/etc. since the lists may not be accepting attachments? On Tue, Aug 2, 2011 at 12:51 AM, Stevens, Keith D. steven...@llnl.gov wrote: Hi all, I'm running a simple mapreduce job that connects to an hbase table, reads each row, counts some co-occurrence frequencies, and writes everything out to hdfs at the end. Everything seems to be going smoothly until the last 5, out of 108, tasks run. The last 5 tasks seem to be stuck initializing. As far as I can tell, setup is never called, and eventually, after 600 seconds, the task is killed. The task jumps around different nodes to try and run but regardless of the node, it fails to initialize and is killed. My first guess is that it's trying to connect to an hbase region server and failing, but I don't see anything like this in the task tracker nodes. Here are the log lines related to one of the failed tasks from the task trackers logs: 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's state:UNASSIGNED 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:01:12,243 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201107281508_0028_m_-1189914759 given task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:09,462 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0: Task attempt_201107281508_0028_m_27_0 failed to report status for 600 seconds. Killing! 2011-08-01 12:11:09,467 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:14,488 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. 2011-08-01 12:11:14,489 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_201107281508_0028_m_27_0 not found in cache 2011-08-01 12:11:14,495 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's state:FAILED_UNCLEAN 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:11:15,045 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201107281508_0028_m_-1869983962 given task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:15,346 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0 0.0% 2011-08-01 12:11:15,348 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0 0.0% cleanup 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201107281508_0028_m_27_0 is done. 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201107281508_0028_m_27_0 was -1 2011-08-01 12:11:15,354 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. 2011-08-01 12:11:17,495 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. And here are the syslog lines: In my job, I set the stats when i enter and exit setup, and I set counters in map. None of these are triggered for this task. Nothing is written to stderr or stdout, and the syslogs for the task have nothing beyond the zookeeper client connection lines. Any thoughts as to what might be causing this issue? Is there another log that indicates which region server this task is trying to connect to? Thanks! --Keith Stevens -- Harsh J
Re: Mappers fail to initialize and are killed after 600 seconds
In short, there are no userlogs. stderr and stdout are both empty. I copied the output from syslog to the following pastebin: http://pastebin.com/0XXE9Jze. The first 22 lines look to be exactly the same as the syslogs for other, non-dying, tasks. The main departure is on line 23 where the loader can't seem to load native-hadoop libraries, and this happens about 10 minutes after starting up. --Keith On Aug 1, 2011, at 1:00 PM, Harsh J wrote: Are there no userlogs from the failed tasks? TaskTracker logs won't carry user-code (task) logs. Could you paste those syslog lines (from the task) to pastebin/etc. since the lists may not be accepting attachments? On Tue, Aug 2, 2011 at 12:51 AM, Stevens, Keith D. steven...@llnl.gov wrote: Hi all, I'm running a simple mapreduce job that connects to an hbase table, reads each row, counts some co-occurrence frequencies, and writes everything out to hdfs at the end. Everything seems to be going smoothly until the last 5, out of 108, tasks run. The last 5 tasks seem to be stuck initializing. As far as I can tell, setup is never called, and eventually, after 600 seconds, the task is killed. The task jumps around different nodes to try and run but regardless of the node, it fails to initialize and is killed. My first guess is that it's trying to connect to an hbase region server and failing, but I don't see anything like this in the task tracker nodes. Here are the log lines related to one of the failed tasks from the task trackers logs: 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's state:UNASSIGNED 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:01:12,243 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201107281508_0028_m_-1189914759 given task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:09,462 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0: Task attempt_201107281508_0028_m_27_0 failed to report status for 600 seconds. Killing! 2011-08-01 12:11:09,467 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:14,488 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. 2011-08-01 12:11:14,489 INFO org.apache.hadoop.mapred.IndexCache: Map ID attempt_201107281508_0028_m_27_0 not found in cache 2011-08-01 12:11:14,495 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's state:FAILED_UNCLEAN 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_201107281508_0028_m_27_0 which needs 1 slots 2011-08-01 12:11:15,045 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201107281508_0028_m_-1869983962 given task: attempt_201107281508_0028_m_27_0 2011-08-01 12:11:15,346 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0 0.0% 2011-08-01 12:11:15,348 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201107281508_0028_m_27_0 0.0% cleanup 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201107281508_0028_m_27_0 is done. 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201107281508_0028_m_27_0 was -1 2011-08-01 12:11:15,354 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. 2011-08-01 12:11:17,495 INFO org.apache.hadoop.mapred.TaskRunner: attempt_201107281508_0028_m_27_0 done; removing files. And here are the syslog lines: In my job, I set the stats when i enter and exit setup, and I set counters in map. None of these are triggered for this task. Nothing is written to stderr or stdout, and the syslogs for the task have nothing beyond the zookeeper client connection lines. Any thoughts as to what might be causing this issue? Is there another log that indicates which region server this task is trying to connect to? Thanks! --Keith Stevens -- Harsh J
RE: Hadoop-streaming using binary executable c program
Hi Bobby, I have written a small Perl script which do the following job: Assume we have an output from the mapper MAP1 RNA-1 STRUCTURE-1 MAP2 RNA-2 STRUCTURE-2 MAP3 RNA-3 STRUCTURE-3 and what the script does is reduce in the following manner : RNA-1RNA-2RNA-3\tSTRUCTURE-1STRUCTURE-2STRUCTURE-3\n and the script looks like this: #!/usr/bin/perl use strict; use warnings; use autodie; my @handles = map { open my $h, '', $_; $h } @ARGV; while (@handles){ @handles = grep { ! eof $_ } @handles; my @lines = map { my $v = $_; chomp $v; $v } @handles; print join(' ', @lines), \n; } close $_ for @handles; This should work for any inputs from the mapper. But after I use hadoop streaming and put the above code as my reducer, the job was successful but the output files were empty. And I couldn't find out. bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -reducer ./reducer.pl -file /data/yehdego/hadoop-0.20.2/reducer.pl -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output /user/yehdego/RFR2-out - verbose Any help or suggestion is really appreciatedI am just stuck here for the weekend. Regards, Daniel T. Yehdego Computational Science Program University of Texas at El Paso, UTEP dtyehd...@miners.utep.edu From: ev...@yahoo-inc.com To: common-user@hadoop.apache.org Date: Thu, 28 Jul 2011 07:12:11 -0700 Subject: Re: Hadoop-streaming using binary executable c program I am not completely sure what you are getting at. It looks like the output of your c program is (And this is just a guess) NOTE: \t stands for the tab character and in streaming it is used to separate the key from the value \n stands for carriage return and is used to separate individual records.. RNA-1\tSTRUCTURE-1\n RNA-2\tSTRUCTURE-2\n RNA-3\tSTRUCTURE-3\n ... And you want the output to look like RNA-1RNA-2RNA-3\tSTRUCTURE-1STRUCTURE-2STRUCTURE-3\n You could use a reduce to do this, but the issue here is with the shuffle in between the maps and the reduces. The Shuffle will group by the key to send to the reducers and then sort by the key. So in reality your map output looks something like FROM MAP 1: RNA-1\tSTRUCTURE-1\n RNA-2\tSTRUCTURE-2\n FROM MAP 2: RNA-3\tSTRUCTURE-3\n RNA-4\tSTRUCTURE-4\n FROM MAP 3: RNA-5\tSTRUCTURE-5\n RNA-6\tSTRUCTURE-6\n If you send it to a single reducer (The only way to get a single file) Then the input to the reducer will be sorted alphabetically by the RNA, and the order of the input will be lost. You can work around this by giving each line a unique number that is in the order you want It to be output. But doing this would require you to write some code. I would suggest that you do it with a small shell script after all the maps have completed to splice them together. -- Bobby On 7/27/11 2:55 PM, Daniel Yehdego dtyehd...@miners.utep.edu wrote: Hi Bobby, I just want to ask you if there is away of using a reducer or something like concatenation to glue my outputs from the mapper and outputs them as a single file and segment of the predicted RNA 2D structure? FYI: I have used a reducer NONE before: HADOOP_HOME$ bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output /user/yehdego/RF-out -reducer NONE -verbose and a sample of my output using the mapper of two different slave nodes looks like this : AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC and [...(((...))).]. (-13.46) GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU .(((.((......).. (-11.00) and I want to concatenate and output them as a single predicated RNA sequence structure: AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU [...(((...))).]..(((.((......).. Regards, Daniel T. Yehdego Computational Science Program University of Texas at El Paso, UTEP dtyehd...@miners.utep.edu From: dtyehd...@miners.utep.edu To: common-user@hadoop.apache.org Subject: RE: Hadoop-streaming using binary executable c program Date: Tue, 26 Jul 2011 16:23:10 + Good afternoon Bobby, Thanks so much, now its working excellent. And the speed is also reasonable. Once again thanks u. Regards, Daniel T. Yehdego Computational
RE: Hadoop cluster network requirement
Yeah what he said. Its never a good idea. Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.) Your entire cluster in both data centers go down. Boom! Its a bad design. You're better off doing two different clusters. Is anyone really trying to sell this as a design? That's even more scary. Subject: Re: Hadoop cluster network requirement From: a...@apache.org Date: Sun, 31 Jul 2011 20:28:53 -0700 To: common-user@hadoop.apache.org; saq...@margallacomm.com On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote: Thanks, I'm independently doing some digging into Hadoop networking requirements and had a couple of quick follow-ups. Could I have some specific info on why different data centers cannot be supported for master node and data node comms? Also, what may be the benefits/use cases for such a scenario? Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems: 1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!) 2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm 3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats 4) I don't even want to think about rebalancing. ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it. If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
How to access contents of a Map Reduce job's working directory
I have just started to explore Hadoop but I am stuck in a situation now. I want to run a MapReduce job in hadoop which needs to create a setup folder in working directory. During the execution the job will generate some additional text files within this setup folder. The problem is I dont know how to access or move this setup folder content to my local file system as at end of the job, the job directory will be cleaned up. It would be great if you can help. Regards Shrish
Re: Hadoop cluster network requirement
Assuming everything is up this solution still will not scale given the latency, tcpip buffers, sliding window etc. See BDP Sent from my iPad On Aug 1, 2011, at 4:57 PM, Michael Segel michael_se...@hotmail.com wrote: Yeah what he said. Its never a good idea. Forget about losing a NN or a Rack, but just losing connectivity between data centers. (It happens more than you think.) Your entire cluster in both data centers go down. Boom! Its a bad design. You're better off doing two different clusters. Is anyone really trying to sell this as a design? That's even more scary. Subject: Re: Hadoop cluster network requirement From: a...@apache.org Date: Sun, 31 Jul 2011 20:28:53 -0700 To: common-user@hadoop.apache.org; saq...@margallacomm.com On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote: Thanks, I'm independently doing some digging into Hadoop networking requirements and had a couple of quick follow-ups. Could I have some specific info on why different data centers cannot be supported for master node and data node comms? Also, what may be the benefits/use cases for such a scenario? Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems: 1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!) 2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm 3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats 4) I don't even want to think about rebalancing. ... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it. If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.
Hive-HBase Integration Jar Question
Hi, I am using hive-hbase-handler-0.7.0-cdh3u0.jar (under hive-0.7.0-cdh3u0/lib) thrift-fb303-0.5.0.jar (under hive-0.7.0-cdh3u0/lib) thrift-0.2.0.jar (under hbase-0.90.1-cdh3u0/lib) in my project. We use Maven; could anyone please tell me where I can get the pom information for these jars? -- Thank you! Neerja
Re: Hive-HBase Integration Jar Question
In our case we have our own maven repo where we uploaded these jars. You can also install it in your local repo from the command line if you don't have your own maven repo. On Aug 2, 2011 7:00 AM, Neerja Bhatnagar bnee...@gmail.com wrote: Hi, I am using hive-hbase-handler-0.7.0-cdh3u0.jar (under hive-0.7.0-cdh3u0/lib) thrift-fb303-0.5.0.jar (under hive-0.7.0-cdh3u0/lib) thrift-0.2.0.jar (under hbase-0.90.1-cdh3u0/lib) in my project. We use Maven; could anyone please tell me where I can get the pom information for these jars? -- Thank you! Neerja
Max Number of Open Connections
Hi, What is the max number of open connections to a namenode? I am using FSDataOutputStream out = dfs.create(src); Cheers, JD
Re: maprd vs mapreduce api
Your reducer is writing IntWritable but your output format class is still Text. Change one of those so they match the other. On Mon, Aug 1, 2011 at 8:40 PM, garpinc garp...@hotmail.com wrote: I was following this tutorial on version 0.19.1 http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html I however wanted to use the latest version of api 0.20.2 The original code in tutorial had following lines conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); both Identity classes are deprecated.. So seemed the solution was to create mapper and reducer as follows: public static class NOOPMapper extends MapperText, IntWritable, Text, IntWritable{ public void map(Text key, IntWritable value, Context context ) throws IOException, InterruptedException { context.write(key, value); } } public static class NOOPReducer extends ReducerText,IntWritable,Text,IntWritable { private IntWritable result = new IntWritable(); public void reduce(Text key, IterableIntWritable values, Context context ) throws IOException, InterruptedException { context.write(key, result); } } And then with code: Configuration conf = new Configuration(); Job job = new Job(conf, testdriver); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(In)); FileOutputFormat.setOutputPath(job, new Path(Out)); job.setMapperClass(NOOPMapper.class); job.setReducerClass(NOOPReducer.class); job.waitForCompletion(true); However I get this message java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text at TestDriver$NOOPMapper.map(TestDriver.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 11/08/01 16:41:01 INFO mapred.JobClient: map 0% reduce 0% 11/08/01 16:41:01 INFO mapred.JobClient: Job complete: job_local_0001 11/08/01 16:41:01 INFO mapred.JobClient: Counters: 0 Can anyone tell me what I need for this to work. Attached is full code.. http://old.nabble.com/file/p32174859/TestDriver.java TestDriver.java -- View this message in context: http://old.nabble.com/maprd-vs-mapreduce-api-tp32174859p32174859.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Roger Chen UC Davis Genome Center
The best architecture for EC2/Hadoop interface?
Hi, I want to give my users a GUI that would allow them to start Hadoop clusters and run applications that I will provide on the AMIs. What would be a good approach to make it simple for the user? Should I write a Java Swing app that will wrap around the EC2 commands? Should I use some more direct EC2 API? Or should I use a web browser interface? My idea was to give the user a Java Swing GUI, so that he gives his Amazon credentials to it, and it would be secure because the application is not exposed to the outside. Does this approach make sense? Thank you, Mark My project for which I want to do it: https://github.com/markkerzner/FreeEed