Re: next gen map reduce

2011-08-01 Thread Dieter Plaetinck
On Thu, 28 Jul 2011 06:13:01 -0700
Thomas Graves tgra...@yahoo-inc.com wrote:

 Its currently still on the MR279 branch -
 http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/.  It is
 planned to be merged to trunk soon.
 
 Tom
 
 On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com
 wrote:
 
  In which Hadoop version is next gen introduced?
 

Hi,
what exactly is contained within this next generation mysterious
sounding MRV2? What's it about?

Dieter


RE: Moving Files to Distributed Cache in MapReduce

2011-08-01 Thread Michael Segel

Yeah,

I'll write something up and post it on my web site. Definitely not InfoQ stuff, 
but a simple tip and tricks stuff.

-Mike


 Subject: Re: Moving Files to Distributed Cache in MapReduce
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 19:21:14 -0700
 To: common-user@hadoop.apache.org
 
 
 We really need to build a working example to the wiki and add a link from the 
 FAQ page.  Any volunteers?
 
 On Jul 29, 2011, at 7:49 PM, Michael Segel wrote:
 
  
  Here's the meat of my post earlier...
  Sample code on putting a file on the cache:
  DistributedCache.addCacheFile(new URI(path+MyFileName,conf));
  
  Sample code in pulling data off the cache:
private Path[] localFiles = 
  DistributedCache.getLocalCacheFiles(context.getConfiguration());
 boolean exitProcess = false;
int i=0;
 while (!exit){ 
 fileName = localFiles[i].getName();
if (fileName.equalsIgnoreCase(model.txt)){
  // Build your input file reader on localFiles[i].toString() 
  exitProcess = true;
}
 i++;
 } 
  
  
  Note that this is SAMPLE code. I didn't trap the exit condition if the file 
  isn't there and you go beyond the size of the array localFiles[].
  Also I set exit to false because its easier to read this as Do this loop 
  until the condition exitProcess is true.
  
  When you build your file reader you need the full path, not just the file 
  name. The path will vary when the job runs.
  
  HTH
  
  -Mike
  
  
  From: michael_se...@hotmail.com
  To: common-user@hadoop.apache.org
  Subject: RE: Moving Files to Distributed Cache in MapReduce
  Date: Fri, 29 Jul 2011 21:43:37 -0500
  
  
  I could have sworn that I gave an example earlier this week on how to push 
  and pull stuff from distributed cache.
  
  
  Date: Fri, 29 Jul 2011 14:51:26 -0700
  Subject: Re: Moving Files to Distributed Cache in MapReduce
  From: rogc...@ucdavis.edu
  To: common-user@hadoop.apache.org
  
  jobConf is deprecated in 0.20.2 I believe; you're supposed to be using
  Configuration for that
  
  On Fri, Jul 29, 2011 at 1:59 PM, Mohit Anchlia 
  mohitanch...@gmail.comwrote:
  
  Is this what you are looking for?
  
  http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
  
  search for jobConf
  
  On Fri, Jul 29, 2011 at 1:51 PM, Roger Chen rogc...@ucdavis.edu wrote:
  Thanks for the response! However, I'm having an issue with this line
  
  Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);
  
  because conf has private access in org.apache.hadoop.configured
  
  On Fri, Jul 29, 2011 at 11:18 AM, Mapred Learn mapred.le...@gmail.com
  wrote:
  
  I hope my previous reply helps...
  
  On Fri, Jul 29, 2011 at 11:11 AM, Roger Chen rogc...@ucdavis.edu
  wrote:
  
  After moving it to the distributed cache, how would I call it within
  my
  MapReduce program?
  
  On Fri, Jul 29, 2011 at 11:09 AM, Mapred Learn 
  mapred.le...@gmail.com
  wrote:
  
  Did you try using -files option in your hadoop jar command as:
  
  /usr/bin/hadoop jar jar name main class name -files  absolute
  path
  of
  file to be added to distributed cache input dir output dir
  
  
  On Fri, Jul 29, 2011 at 11:05 AM, Roger Chen rogc...@ucdavis.edu
  wrote:
  
  Slight modification: I now know how to add files to the
  distributed
  file
  cache, which can be done via this command placed in the main or
  run
  class:
  
DistributedCache.addCacheFile(new
  URI(/user/hadoop/thefile.dat),
  conf);
  
  However I am still having trouble locating the file in the
  distributed
  cache. *How do I call the file path of thefile.dat in the
  distributed
  cache
  as a string?* I am using Hadoop 0.20.2
  
  
  On Fri, Jul 29, 2011 at 10:26 AM, Roger Chen rogc...@ucdavis.edu
  
  wrote:
  
  Hi all,
  
  Does anybody have examples of how one moves files from the local
  filestructure/HDFS to the distributed cache in MapReduce? A
  Google
  search
  turned up examples in Pig but not MR.
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  
  --
  Roger Chen
  UC Davis Genome Center
  
  
  
  
  
  -- 
  Roger Chen
  UC Davis Genome Center
   

 
  

Re: next gen map reduce

2011-08-01 Thread Thomas Graves
The jira has more details and an architecture doc attached.

https://issues.apache.org/jira/browse/MAPREDUCE-279

Tom


On 8/1/11 2:12 AM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be
wrote:

 On Thu, 28 Jul 2011 06:13:01 -0700
 Thomas Graves tgra...@yahoo-inc.com wrote:
 
 Its currently still on the MR279 branch -
 http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/.  It is
 planned to be merged to trunk soon.
 
 Tom
 
 On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com
 wrote:
 
 In which Hadoop version is next gen introduced?
 
 
 Hi,
 what exactly is contained within this next generation mysterious
 sounding MRV2? What's it about?
 
 Dieter



Using -libjar option

2011-08-01 Thread Aquil H. Abdullah
Hello All,

I am new to Hadoop, and I am trying to use the GenericOptionsParser Class.
In particular, I would like to use the -libjar option to specify additional
jar files to include in the classpath. I've created a class that extends
Configured and Implements Tool:

*public class* OptionDemo *extends* Configured *implements* Tool

{

...

*public int* run(String[] args) *throws* Exception

{

Configuration conf = getConf();

GenericOptionsParser opts = *new* GenericOptionsParser(conf, args);

...

}

}


However, when I run my code the jar files that I include after -libjar
aren't being added to the classpath and I receive an error that certain
classes can't be found during the execution of my job.

The book Hadoop: The Definitive Guide states:

You don’t usually use GenericOptionsParser directly, as it’s more convenient
to implement the Tool interface and run your application with the
ToolRunner, which uses GenericOptionsParser internally:
public interface Tool extends Configurable {
int run(String [] args) throws Exception;
}

but it still isn't clear to me how the -libjars option is parsed, whether or
not I need to explicitly add it to the classpath inside my run method, or
where it needs to be placed in the command-line? Any advice or sample code
on using -libjar would greatly be appreciated.

-- 
Aquil H. Abdullah
aquil.abdul...@gmail.com


Re: Using -libjar option

2011-08-01 Thread John Armstrong
On Mon, 1 Aug 2011 12:11:27 -0400, Aquil H. Abdullah
aquil.abdul...@gmail.com wrote:
 but it still isn't clear to me how the -libjars option is parsed,
whether
 or
 not I need to explicitly add it to the classpath inside my run method,
or
 where it needs to be placed in the command-line?

IIRC it's parsed as a comma-separated list of file paths relative to your
current working directory, and the local copies that it makes on each
cluster node are automatically added to the tasks' classpaths.

Can you give an example of how you're trying to use it?


Re: Using -libjar option

2011-08-01 Thread Harsh J
Aquil,

On a side-note, if you use Tool, GenericOptsParser is automatically
used internally (by ToolRunner), so you don't have to re-parse your
args in your run(…) method. What you get as run(args) are the remnant
args alone, if your application handles any.

Would help, as John pointed out, if you could give your exact,
invoking CLI command.

On Mon, Aug 1, 2011 at 9:41 PM, Aquil H. Abdullah
aquil.abdul...@gmail.com wrote:
 Hello All,

 I am new to Hadoop, and I am trying to use the GenericOptionsParser Class.
 In particular, I would like to use the -libjar option to specify additional
 jar files to include in the classpath. I've created a class that extends
 Configured and Implements Tool:

 *public class* OptionDemo *extends* Configured *implements* Tool

 {

    ...

 *    public int* run(String[] args) *throws* Exception

    {

        Configuration conf = getConf();

        GenericOptionsParser opts = *new* GenericOptionsParser(conf, args);

        ...

    }

 }


 However, when I run my code the jar files that I include after -libjar
 aren't being added to the classpath and I receive an error that certain
 classes can't be found during the execution of my job.

 The book Hadoop: The Definitive Guide states:

 You don’t usually use GenericOptionsParser directly, as it’s more convenient
 to implement the Tool interface and run your application with the
 ToolRunner, which uses GenericOptionsParser internally:
 public interface Tool extends Configurable {
    int run(String [] args) throws Exception;
 }

 but it still isn't clear to me how the -libjars option is parsed, whether or
 not I need to explicitly add it to the classpath inside my run method, or
 where it needs to be placed in the command-line? Any advice or sample code
 on using -libjar would greatly be appreciated.

 --
 Aquil H. Abdullah
 aquil.abdul...@gmail.com




-- 
Harsh J


Re: Using -libjar option

2011-08-01 Thread Aquil H. Abdullah
[See Response Inline]

I've tried invoking getLib
On Mon, Aug 1, 2011 at 12:56 PM, Harsh J ha...@cloudera.com wrote:

 Aquil,

 On a side-note, if you use Tool, GenericOptsParser is automatically
 used internally (by ToolRunner), so you don't have to re-parse your
 args in your run(…) method. What you get as run(args) are the remnant
 args alone, if your application handles any.

[AA] Thanks for clearing that up!


 Would help, as John pointed out, if you could give your exact,
 invoking CLI command.


[AA] I am currently invoking my application as follows:

hadoop jar /home/test/hadoop/test.option.demo.jar
test.option.demo.OptionDemo -libjar /home/test/hadoop/lib/mytestlib.jar





 On Mon, Aug 1, 2011 at 9:41 PM, Aquil H. Abdullah
 aquil.abdul...@gmail.com wrote:
  Hello All,
 
  I am new to Hadoop, and I am trying to use the GenericOptionsParser
 Class.
  In particular, I would like to use the -libjar option to specify
 additional
  jar files to include in the classpath. I've created a class that extends
  Configured and Implements Tool:
 
  *public class* OptionDemo *extends* Configured *implements* Tool
 
  {
 
 ...
 
  *public int* run(String[] args) *throws* Exception
 
 {
 
 Configuration conf = getConf();
 
 GenericOptionsParser opts = *new* GenericOptionsParser(conf,
 args);
 
 ...
 
 }
 
  }
 
 
  However, when I run my code the jar files that I include after -libjar
  aren't being added to the classpath and I receive an error that certain
  classes can't be found during the execution of my job.
 
  The book Hadoop: The Definitive Guide states:
 
  You don’t usually use GenericOptionsParser directly, as it’s more
 convenient
  to implement the Tool interface and run your application with the
  ToolRunner, which uses GenericOptionsParser internally:
  public interface Tool extends Configurable {
 int run(String [] args) throws Exception;
  }
 
  but it still isn't clear to me how the -libjars option is parsed, whether
 or
  not I need to explicitly add it to the classpath inside my run method, or
  where it needs to be placed in the command-line? Any advice or sample
 code
  on using -libjar would greatly be appreciated.
 
  --
  Aquil H. Abdullah
  aquil.abdul...@gmail.com
 



 --
 Harsh J




-- 
Aquil H. Abdullah
aquil.abdul...@gmail.com


Re: Using -libjar option

2011-08-01 Thread John Armstrong
On Mon, 1 Aug 2011 13:21:27 -0400, Aquil H. Abdullah
aquil.abdul...@gmail.com wrote:
 [AA] I am currently invoking my application as follows:
 
 hadoop jar /home/test/hadoop/test.option.demo.jar
 test.option.demo.OptionDemo -libjar /home/test/hadoop/lib/mytestlib.jar

I believe the problem might be that it's looking for -libjars, not
-libjar.


Mappers fail to initialize and are killed after 600 seconds

2011-08-01 Thread Stevens, Keith D.
Hi all,

I'm running a simple mapreduce job that connects to an hbase table, reads each 
row, counts some co-occurrence frequencies, and writes everything out to hdfs 
at the end.  Everything seems to be going smoothly until the last 5, out of 
108, tasks run.  The last 5 tasks seem to be stuck initializing.  As far as I 
can tell, setup is never called, and eventually, after 600 seconds, the task is 
killed.  The task jumps around different nodes to try and run but regardless of 
the node, it fails to initialize and is killed.

My first guess is that it's trying to connect to an hbase region server and 
failing, but I don't see anything like this in the task tracker nodes.  Here 
are the log lines related to one of the failed tasks from the task trackers 
logs:

2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's 
state:UNASSIGNED
2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_201107281508_0028_m_27_0 which needs 1 slots
2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 1 and trying to launch 
attempt_201107281508_0028_m_27_0 which needs 1 slots
2011-08-01 12:01:12,243 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: 
jvm_201107281508_0028_m_-1189914759 given task: 
attempt_201107281508_0028_m_27_0
2011-08-01 12:11:09,462 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201107281508_0028_m_27_0: Task attempt_201107281508_0028_m_27_0 
failed to report status for 600 seconds. Killing!
2011-08-01 12:11:09,467 INFO org.apache.hadoop.mapred.TaskTracker: About to 
purge task: attempt_201107281508_0028_m_27_0
2011-08-01 12:11:14,488 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_201107281508_0028_m_27_0 done; removing files.
2011-08-01 12:11:14,489 INFO org.apache.hadoop.mapred.IndexCache: Map ID 
attempt_201107281508_0028_m_27_0 not found in cache
2011-08-01 12:11:14,495 INFO org.apache.hadoop.mapred.TaskTracker: 
LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's 
state:FAILED_UNCLEAN
2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
launch : attempt_201107281508_0028_m_27_0 which needs 1 slots
2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: In 
TaskLauncher, current free slots : 1 and trying to launch 
attempt_201107281508_0028_m_27_0 which needs 1 slots
2011-08-01 12:11:15,045 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: 
jvm_201107281508_0028_m_-1869983962 given task: 
attempt_201107281508_0028_m_27_0
2011-08-01 12:11:15,346 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201107281508_0028_m_27_0 0.0% 
2011-08-01 12:11:15,348 INFO org.apache.hadoop.mapred.TaskTracker: 
attempt_201107281508_0028_m_27_0 0.0% cleanup
2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: Task 
attempt_201107281508_0028_m_27_0 is done.
2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: reported 
output size for attempt_201107281508_0028_m_27_0  was -1
2011-08-01 12:11:15,354 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_201107281508_0028_m_27_0 done; removing files.
2011-08-01 12:11:17,495 INFO org.apache.hadoop.mapred.TaskRunner: 
attempt_201107281508_0028_m_27_0 done; removing files.

And here are the syslog lines:
In my job, I set the stats when i enter and exit setup, and I set counters in 
map.  None of these are triggered for this task.  Nothing is written to stderr 
or stdout, and the syslogs for the task have nothing beyond the zookeeper 
client connection lines.

Any thoughts as to what might be causing this issue?  Is there another log that 
indicates which region server this task is trying to connect to?

Thanks!
--Keith Stevens

Re: Using -libjar option

2011-08-01 Thread Aquil H. Abdullah
Don't I feel sheepish...

OK, so I've hacked this sample code below, from the ConfigurationPrinter
example in Hadoop: The Definitive Guide. If -libjars had been added to the
configuration I would expect to see it when I iterate over the urls, however
I see it as one of the remaining options:

***OUTPUT***
remaining args -libjars
remaining args C:\Apps\mahout-distribution-0.5\mahout-core-0.5.jar
***
[Source Code]
package test.option.demo;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.util.*;
// import java.util.*;
import java.net.URL;
// import java.util.Map.Entry;
public class OptionDemo extends Configured implements Tool{
 static
 {
  Configuration.addDefaultResource(hdfs-default.xml);
  Configuration.addDefaultResource(hdfs-site.xml);
  Configuration.addDefaultResource(mapred-default.xml);
  Configuration.addDefaultResource(mapred-site.xml);
 }

 @Override
 public int run(String[] args) throws Exception
 {
  GenericOptionsParser opt = new GenericOptionsParser(args);
  Configuration conf = opt.getConfiguration();
  // for (EntryString, String entry: conf)
  // {
  // System.out.printf(%s=%s\n, entry.getKey(), entry.getValue());
  // }

  for (int i = 0; i  args.length;i++)
  {
   System.out.printf(remaining args %s\n, args[i]);
  }

  URL[] urls = GenericOptionsParser.getLibJars(conf);

  if (urls != null)
  {
   for (int j = 0; j  urls.length;j++)
   {
System.out.printf(url[%d] %s, j, urls[j].toString());
   }else
System.out.println(No libraries added to configuration);
   }
  }

  return 0;
 }

 public static void main(String[] args) throws Exception
 {
  int exitCode = ToolRunner.run(new OptionDemo(), args);
  System.exit(exitCode);
 }
}



On Mon, Aug 1, 2011 at 2:17 PM, John Armstrong john.armstr...@ccri.comwrote:

 On Mon, 1 Aug 2011 13:21:27 -0400, Aquil H. Abdullah
 aquil.abdul...@gmail.com wrote:
  [AA] I am currently invoking my application as follows:
 
  hadoop jar /home/test/hadoop/test.option.demo.jar
  test.option.demo.OptionDemo -libjar /home/test/hadoop/lib/mytestlib.jar

 I believe the problem might be that it's looking for -libjars, not
 -libjar.




-- 
Aquil H. Abdullah
aquil.abdul...@gmail.com


Re: Using -libjar option

2011-08-01 Thread John Armstrong
On Mon, 1 Aug 2011 15:30:49 -0400, Aquil H. Abdullah
aquil.abdul...@gmail.com wrote:
 Don't I feel sheepish...

Happens to the best, or so they tell me.

 OK, so I've hacked this sample code below, from the ConfigurationPrinter
 example in Hadoop: The Definitive Guide. If -libjars had been added to
the
 configuration I would expect to see it when I iterate over the urls,
 however
 I see it as one of the remaining options:

It might help you to read over the source code of the ToolRunner class.  I
know it did for me.


Re: Mappers fail to initialize and are killed after 600 seconds

2011-08-01 Thread Harsh J
Are there no userlogs from the failed tasks? TaskTracker logs won't
carry user-code (task) logs. Could you paste those syslog lines (from
the task) to pastebin/etc. since the lists may not be accepting
attachments?

On Tue, Aug 2, 2011 at 12:51 AM, Stevens, Keith D. steven...@llnl.gov wrote:
 Hi all,

 I'm running a simple mapreduce job that connects to an hbase table, reads 
 each row, counts some co-occurrence frequencies, and writes everything out to 
 hdfs at the end.  Everything seems to be going smoothly until the last 5, out 
 of 108, tasks run.  The last 5 tasks seem to be stuck initializing.  As far 
 as I can tell, setup is never called, and eventually, after 600 seconds, the 
 task is killed.  The task jumps around different nodes to try and run but 
 regardless of the node, it fails to initialize and is killed.

 My first guess is that it's trying to connect to an hbase region server and 
 failing, but I don't see anything like this in the task tracker nodes.  Here 
 are the log lines related to one of the failed tasks from the task trackers 
 logs:

 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: 
 LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's 
 state:UNASSIGNED
 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
 launch : attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: In 
 TaskLauncher, current free slots : 1 and trying to launch 
 attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:01:12,243 INFO org.apache.hadoop.mapred.TaskTracker: JVM with 
 ID: jvm_201107281508_0028_m_-1189914759 given task: 
 attempt_201107281508_0028_m_27_0
 2011-08-01 12:11:09,462 INFO org.apache.hadoop.mapred.TaskTracker: 
 attempt_201107281508_0028_m_27_0: Task 
 attempt_201107281508_0028_m_27_0 failed to report status for 600 seconds. 
 Killing!
 2011-08-01 12:11:09,467 INFO org.apache.hadoop.mapred.TaskTracker: About to 
 purge task: attempt_201107281508_0028_m_27_0
 2011-08-01 12:11:14,488 INFO org.apache.hadoop.mapred.TaskRunner: 
 attempt_201107281508_0028_m_27_0 done; removing files.
 2011-08-01 12:11:14,489 INFO org.apache.hadoop.mapred.IndexCache: Map ID 
 attempt_201107281508_0028_m_27_0 not found in cache
 2011-08-01 12:11:14,495 INFO org.apache.hadoop.mapred.TaskTracker: 
 LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's 
 state:FAILED_UNCLEAN
 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
 launch : attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: In 
 TaskLauncher, current free slots : 1 and trying to launch 
 attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:11:15,045 INFO org.apache.hadoop.mapred.TaskTracker: JVM with 
 ID: jvm_201107281508_0028_m_-1869983962 given task: 
 attempt_201107281508_0028_m_27_0
 2011-08-01 12:11:15,346 INFO org.apache.hadoop.mapred.TaskTracker: 
 attempt_201107281508_0028_m_27_0 0.0%
 2011-08-01 12:11:15,348 INFO org.apache.hadoop.mapred.TaskTracker: 
 attempt_201107281508_0028_m_27_0 0.0% cleanup
 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: Task 
 attempt_201107281508_0028_m_27_0 is done.
 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: reported 
 output size for attempt_201107281508_0028_m_27_0  was -1
 2011-08-01 12:11:15,354 INFO org.apache.hadoop.mapred.TaskRunner: 
 attempt_201107281508_0028_m_27_0 done; removing files.
 2011-08-01 12:11:17,495 INFO org.apache.hadoop.mapred.TaskRunner: 
 attempt_201107281508_0028_m_27_0 done; removing files.

 And here are the syslog lines:
 In my job, I set the stats when i enter and exit setup, and I set counters in 
 map.  None of these are triggered for this task.  Nothing is written to 
 stderr or stdout, and the syslogs for the task have nothing beyond the 
 zookeeper client connection lines.

 Any thoughts as to what might be causing this issue?  Is there another log 
 that indicates which region server this task is trying to connect to?

 Thanks!
 --Keith Stevens



-- 
Harsh J


Re: Mappers fail to initialize and are killed after 600 seconds

2011-08-01 Thread Stevens, Keith D.
In short, there are no userlogs.  stderr and stdout are both empty.  I copied 
the output from syslog to the following pastebin: http://pastebin.com/0XXE9Jze. 
 The first 22 lines look to be exactly the same as the syslogs for other, 
non-dying, tasks.   The main departure is on line 23 where the loader can't 
seem to load native-hadoop libraries, and this happens about 10 minutes after 
starting up.

--Keith

On Aug 1, 2011, at 1:00 PM, Harsh J wrote:

 Are there no userlogs from the failed tasks? TaskTracker logs won't
 carry user-code (task) logs. Could you paste those syslog lines (from
 the task) to pastebin/etc. since the lists may not be accepting
 attachments?
 
 On Tue, Aug 2, 2011 at 12:51 AM, Stevens, Keith D. steven...@llnl.gov wrote:
 Hi all,
 
 I'm running a simple mapreduce job that connects to an hbase table, reads 
 each row, counts some co-occurrence frequencies, and writes everything out 
 to hdfs at the end.  Everything seems to be going smoothly until the last 5, 
 out of 108, tasks run.  The last 5 tasks seem to be stuck initializing.  As 
 far as I can tell, setup is never called, and eventually, after 600 seconds, 
 the task is killed.  The task jumps around different nodes to try and run 
 but regardless of the node, it fails to initialize and is killed.
 
 My first guess is that it's trying to connect to an hbase region server and 
 failing, but I don't see anything like this in the task tracker nodes.  Here 
 are the log lines related to one of the failed tasks from the task trackers 
 logs:
 
 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: 
 LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's 
 state:UNASSIGNED
 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
 launch : attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:01:08,889 INFO org.apache.hadoop.mapred.TaskTracker: In 
 TaskLauncher, current free slots : 1 and trying to launch 
 attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:01:12,243 INFO org.apache.hadoop.mapred.TaskTracker: JVM with 
 ID: jvm_201107281508_0028_m_-1189914759 given task: 
 attempt_201107281508_0028_m_27_0
 2011-08-01 12:11:09,462 INFO org.apache.hadoop.mapred.TaskTracker: 
 attempt_201107281508_0028_m_27_0: Task 
 attempt_201107281508_0028_m_27_0 failed to report status for 600 
 seconds. Killing!
 2011-08-01 12:11:09,467 INFO org.apache.hadoop.mapred.TaskTracker: About to 
 purge task: attempt_201107281508_0028_m_27_0
 2011-08-01 12:11:14,488 INFO org.apache.hadoop.mapred.TaskRunner: 
 attempt_201107281508_0028_m_27_0 done; removing files.
 2011-08-01 12:11:14,489 INFO org.apache.hadoop.mapred.IndexCache: Map ID 
 attempt_201107281508_0028_m_27_0 not found in cache
 2011-08-01 12:11:14,495 INFO org.apache.hadoop.mapred.TaskTracker: 
 LaunchTaskAction (registerTask): attempt_201107281508_0028_m_27_0 task's 
 state:FAILED_UNCLEAN
 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: Trying to 
 launch : attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:11:14,496 INFO org.apache.hadoop.mapred.TaskTracker: In 
 TaskLauncher, current free slots : 1 and trying to launch 
 attempt_201107281508_0028_m_27_0 which needs 1 slots
 2011-08-01 12:11:15,045 INFO org.apache.hadoop.mapred.TaskTracker: JVM with 
 ID: jvm_201107281508_0028_m_-1869983962 given task: 
 attempt_201107281508_0028_m_27_0
 2011-08-01 12:11:15,346 INFO org.apache.hadoop.mapred.TaskTracker: 
 attempt_201107281508_0028_m_27_0 0.0%
 2011-08-01 12:11:15,348 INFO org.apache.hadoop.mapred.TaskTracker: 
 attempt_201107281508_0028_m_27_0 0.0% cleanup
 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: Task 
 attempt_201107281508_0028_m_27_0 is done.
 2011-08-01 12:11:15,349 INFO org.apache.hadoop.mapred.TaskTracker: reported 
 output size for attempt_201107281508_0028_m_27_0  was -1
 2011-08-01 12:11:15,354 INFO org.apache.hadoop.mapred.TaskRunner: 
 attempt_201107281508_0028_m_27_0 done; removing files.
 2011-08-01 12:11:17,495 INFO org.apache.hadoop.mapred.TaskRunner: 
 attempt_201107281508_0028_m_27_0 done; removing files.
 
 And here are the syslog lines:
 In my job, I set the stats when i enter and exit setup, and I set counters 
 in map.  None of these are triggered for this task.  Nothing is written to 
 stderr or stdout, and the syslogs for the task have nothing beyond the 
 zookeeper client connection lines.
 
 Any thoughts as to what might be causing this issue?  Is there another log 
 that indicates which region server this task is trying to connect to?
 
 Thanks!
 --Keith Stevens
 
 
 
 -- 
 Harsh J



RE: Hadoop-streaming using binary executable c program

2011-08-01 Thread Daniel Yehdego

Hi Bobby, 

I have written a small Perl script which do the following job:

Assume we have an output from the mapper

MAP1
RNA-1
STRUCTURE-1

MAP2
RNA-2
STRUCTURE-2

MAP3
RNA-3
STRUCTURE-3

and what the script does is reduce in the following manner : 
RNA-1RNA-2RNA-3\tSTRUCTURE-1STRUCTURE-2STRUCTURE-3\n
 and the script looks like this:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my @handles = map { open my $h, '', $_; $h } @ARGV;

while (@handles){
@handles = grep { ! eof $_ } @handles;
my @lines = map { my $v = $_; chomp $v; $v } @handles;
print join(' ', @lines), \n;
}

close $_ for @handles;

This should work for any inputs from the  mapper. But after I use hadoop 
streaming and put the above code as my reducer, the job was successful
but the output files were empty. And I couldn't find out.

 bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar 
-mapper ./hadoopPknotsRG 
-file /data/yehdego/hadoop-0.20.2/pknotsRG 
-file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG 
-reducer ./reducer.pl 
-file /data/yehdego/hadoop-0.20.2/reducer.pl  
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt 
-output /user/yehdego/RFR2-out - verbose

Any help or suggestion is really appreciatedI am just stuck here for the 
weekend.
 
Regards, 

Daniel T. Yehdego
Computational Science Program 
University of Texas at El Paso, UTEP 
dtyehd...@miners.utep.edu

 From: ev...@yahoo-inc.com
 To: common-user@hadoop.apache.org
 Date: Thu, 28 Jul 2011 07:12:11 -0700
 Subject: Re: Hadoop-streaming using binary executable c program
 
 I am not completely sure what you are getting at.  It looks like the output 
 of your c program is (And this is just a guess)  NOTE: \t stands for the tab 
 character and in streaming it is used to separate the key from the value \n 
 stands for carriage return and is used to separate individual records..
 RNA-1\tSTRUCTURE-1\n
 RNA-2\tSTRUCTURE-2\n
 RNA-3\tSTRUCTURE-3\n
 ...
 
 
 And you want the output to look like
 RNA-1RNA-2RNA-3\tSTRUCTURE-1STRUCTURE-2STRUCTURE-3\n
 
 You could use a reduce to do this, but the issue here is with the shuffle in 
 between the maps and the reduces.  The Shuffle will group by the key to send 
 to the reducers and then sort by the key.  So in reality your map output 
 looks something like
 
 FROM MAP 1:
 RNA-1\tSTRUCTURE-1\n
 RNA-2\tSTRUCTURE-2\n
 
 FROM MAP 2:
 RNA-3\tSTRUCTURE-3\n
 RNA-4\tSTRUCTURE-4\n
 
 FROM MAP 3:
 RNA-5\tSTRUCTURE-5\n
 RNA-6\tSTRUCTURE-6\n
 
 If you send it to a single reducer (The only way to get a single file) Then 
 the input to the reducer will be sorted alphabetically by the RNA, and the 
 order of the input will be lost.  You can work around this by giving each 
 line a unique number that is in the order you want It to be output.  But 
 doing this would require you to write some code.  I would suggest that you do 
 it with a small shell script after all the maps have completed to splice them 
 together.
 
 --
 Bobby
 
 On 7/27/11 2:55 PM, Daniel Yehdego dtyehd...@miners.utep.edu wrote:
 
 
 
 Hi Bobby,
 
 I just want to ask you if there is away of using a reducer or something like 
 concatenation to glue my outputs from the mapper and outputs
 them as a single file and segment of the predicted RNA 2D structure?
 
 FYI: I have used a reducer NONE before:
 
 HADOOP_HOME$ bin/hadoop jar
 /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper
 ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file
 /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input
 /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output
 /user/yehdego/RF-out -reducer NONE -verbose
 
 and a sample of my output using the mapper of two different slave nodes looks 
 like this :
 
 AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC
 and
 [...(((...))).].
   (-13.46)
 
 GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
 .(((.((......)..  (-11.00)
 
 and I want to concatenate and output them as a single predicated RNA sequence 
 structure:
 
 AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU
 
 [...(((...))).]..(((.((......)..
 
 
 Regards,
 
 Daniel T. Yehdego
 Computational Science Program
 University of Texas at El Paso, UTEP
 dtyehd...@miners.utep.edu
 
  From: dtyehd...@miners.utep.edu
  To: common-user@hadoop.apache.org
  Subject: RE: Hadoop-streaming using binary executable c program
  Date: Tue, 26 Jul 2011 16:23:10 +
 
 
  Good afternoon Bobby,
 
  Thanks so much, now its working excellent. And the speed is also 
  reasonable. Once again thanks u.
 
  Regards,
 
  Daniel T. Yehdego
  Computational 

RE: Hadoop cluster network requirement

2011-08-01 Thread Michael Segel

Yeah what he said.
Its never a good idea.
Forget about losing a NN or a Rack, but just losing connectivity between data 
centers. (It happens more than you think.)
Your entire cluster in both data centers go down. Boom!

Its a bad design. 

You're better off doing two different clusters.

Is anyone really trying to sell this as a design? That's even more scary.


 Subject: Re: Hadoop cluster network requirement
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 20:28:53 -0700
 To: common-user@hadoop.apache.org; saq...@margallacomm.com
 
 
 On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
 
  Thanks, I'm independently doing some digging into Hadoop networking
  requirements and 
  had a couple of quick follow-ups. Could I have some specific info on why
  different data centers 
  cannot be supported for master node and data node comms?
  Also, what 
  may be the benefits/use cases for such a scenario?
 
   Most people who try to put the NN and DNs in different data centers are 
 trying to achieve disaster recovery:  one file system in multiple locations.  
 That isn't the way HDFS is designed and it will end in tears. There are 
 multiple problems:
 
 1) no guarantee that one block replica will be each data center (thereby 
 defeating the whole purpose!)
 2) assuming one can work out problem 1, during a network break, the NN will 
 lose contact from one half of the  DNs, causing a massive network replication 
 storm
 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
 network in between (making MR performance pretty dreadful) is going to cause 
 delays for the DN heartbeats
 4) I don't even want to think about rebalancing.
 
   ... and I'm sure a lot of other problems I'm forgetting at the moment.  
 So don't do it.
 
   If you want disaster recovery, set up two completely separate HDFSes 
 and run everything in parallel.
  

How to access contents of a Map Reduce job's working directory

2011-08-01 Thread Shrish Bajpai
I have just started to explore Hadoop but I am stuck in a situation now.

I want to run a MapReduce job in hadoop which needs to create a setup
folder in working directory. During the execution the job will generate
some additional text files within this setup folder. The problem is I
dont know how to access or move this setup folder content to my local file
system as at end of the job, the job directory will be cleaned up.

It would be great if you can help.

Regards

Shrish



Re: Hadoop cluster network requirement

2011-08-01 Thread Mohit Anchlia
Assuming everything is up this solution still will not scale given the latency, 
tcpip buffers, sliding window etc. See BDP

Sent from my iPad

On Aug 1, 2011, at 4:57 PM, Michael Segel michael_se...@hotmail.com wrote:

 
 Yeah what he said.
 Its never a good idea.
 Forget about losing a NN or a Rack, but just losing connectivity between data 
 centers. (It happens more than you think.)
 Your entire cluster in both data centers go down. Boom!
 
 Its a bad design. 
 
 You're better off doing two different clusters.
 
 Is anyone really trying to sell this as a design? That's even more scary.
 
 
 Subject: Re: Hadoop cluster network requirement
 From: a...@apache.org
 Date: Sun, 31 Jul 2011 20:28:53 -0700
 To: common-user@hadoop.apache.org; saq...@margallacomm.com
 
 
 On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
 
 Thanks, I'm independently doing some digging into Hadoop networking
 requirements and 
 had a couple of quick follow-ups. Could I have some specific info on why
 different data centers 
 cannot be supported for master node and data node comms?
 Also, what 
 may be the benefits/use cases for such a scenario?
 
Most people who try to put the NN and DNs in different data centers are 
 trying to achieve disaster recovery:  one file system in multiple locations. 
  That isn't the way HDFS is designed and it will end in tears. There are 
 multiple problems:
 
 1) no guarantee that one block replica will be each data center (thereby 
 defeating the whole purpose!)
 2) assuming one can work out problem 1, during a network break, the NN will 
 lose contact from one half of the  DNs, causing a massive network 
 replication storm
 3) if one using MR on top of this HDFS, the shuffle will likely kill the 
 network in between (making MR performance pretty dreadful) is going to cause 
 delays for the DN heartbeats
 4) I don't even want to think about rebalancing.
 
... and I'm sure a lot of other problems I'm forgetting at the moment.  
 So don't do it.
 
If you want disaster recovery, set up two completely separate HDFSes and 
 run everything in parallel.
 


Hive-HBase Integration Jar Question

2011-08-01 Thread Neerja Bhatnagar
Hi,

I am using

hive-hbase-handler-0.7.0-cdh3u0.jar (under hive-0.7.0-cdh3u0/lib)
thrift-fb303-0.5.0.jar (under hive-0.7.0-cdh3u0/lib)
thrift-0.2.0.jar (under hbase-0.90.1-cdh3u0/lib)

in my project.

We use Maven; could anyone please tell me where I can get the pom
information for these jars?

-- 
Thank you! Neerja


Re: Hive-HBase Integration Jar Question

2011-08-01 Thread Mayuresh
In our case we have our own maven repo where we uploaded these jars. You can
also install it in your local repo from the command line if you don't have
your own maven repo.
On Aug 2, 2011 7:00 AM, Neerja Bhatnagar bnee...@gmail.com wrote:
 Hi,

 I am using

 hive-hbase-handler-0.7.0-cdh3u0.jar (under hive-0.7.0-cdh3u0/lib)
 thrift-fb303-0.5.0.jar (under hive-0.7.0-cdh3u0/lib)
 thrift-0.2.0.jar (under hbase-0.90.1-cdh3u0/lib)

 in my project.

 We use Maven; could anyone please tell me where I can get the pom
 information for these jars?

 --
 Thank you! Neerja


Max Number of Open Connections

2011-08-01 Thread jagaran das


Hi,

What is the max number of open connections to a namenode?

I am using 


FSDataOutputStream out = dfs.create(src);

Cheers,
JD 


Re: maprd vs mapreduce api

2011-08-01 Thread Roger Chen
Your reducer is writing IntWritable but your output format class is still
Text. Change one of those so they match the other.

On Mon, Aug 1, 2011 at 8:40 PM, garpinc garp...@hotmail.com wrote:


 I was following this tutorial on version 0.19.1

 http://v-lad.org/Tutorials/Hadoop/23%20-%20create%20the%20project.html

 I however wanted to use the latest version of api 0.20.2

 The original code in tutorial had following lines
 conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class);
 conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);

 both Identity classes are deprecated.. So seemed the solution was to create
 mapper and reducer as follows:
  public static class NOOPMapper
  extends MapperText, IntWritable, Text, IntWritable{


   public void map(Text key, IntWritable value, Context context
   ) throws IOException, InterruptedException {

   context.write(key, value);

   }
  }

  public static class NOOPReducer
  extends ReducerText,IntWritable,Text,IntWritable {
   private IntWritable result = new IntWritable();

   public void reduce(Text key, IterableIntWritable values,
  Context context
  ) throws IOException, InterruptedException {
 context.write(key, result);
   }
  }


 And then with code:
Configuration conf = new Configuration();
Job job = new Job(conf, testdriver);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(In));
FileOutputFormat.setOutputPath(job, new Path(Out));

job.setMapperClass(NOOPMapper.class);
job.setReducerClass(NOOPReducer.class);

job.waitForCompletion(true);


 However I get this message
 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
 cast to org.apache.hadoop.io.Text
at TestDriver$NOOPMapper.map(TestDriver.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 11/08/01 16:41:01 INFO mapred.JobClient:  map 0% reduce 0%
 11/08/01 16:41:01 INFO mapred.JobClient: Job complete: job_local_0001
 11/08/01 16:41:01 INFO mapred.JobClient: Counters: 0



 Can anyone tell me what I need for this to work.

 Attached is full code..
 http://old.nabble.com/file/p32174859/TestDriver.java TestDriver.java
 --
 View this message in context:
 http://old.nabble.com/maprd-vs-mapreduce-api-tp32174859p32174859.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Roger Chen
UC Davis Genome Center


The best architecture for EC2/Hadoop interface?

2011-08-01 Thread Mark Kerzner
Hi,

I want to give my users a GUI that would allow them to start Hadoop clusters
and run applications that I will provide on the AMIs. What would be a good
approach to make it simple for the user? Should I write a Java Swing app
that will wrap around the EC2 commands? Should I use some more direct EC2
API? Or should I use a web browser interface?

My idea was to give the user a Java Swing GUI, so that he gives his Amazon
credentials to it, and it would be secure because the application is not
exposed to the outside. Does this approach make sense?

Thank you,
Mark

My project for which I want to do it: https://github.com/markkerzner/FreeEed