contrib join package

2008-09-05 Thread Rahul Sood
Hi,

 

Is there any detailed documentation on the
org.apache.hadoop.contrib.utils.join package ? I have a simple Join task
consisting of 2 input datasets. Each contains tab-separated records.

 

Set1: Record format = field1\tfield2\tfield3\tfield4\tfield5

Set2: Record format = field1\tfield2\tfield3

 

Join criterion: Set1.field1 = Set2.field1

 

Output: Set2.field2\tSet1.field2\tSet1.field3\tSet1.field4

 

The org.apache.hadoop.contrib.utils.join package contains DataJoinMapperBase
and DataJoinReducerBase abstract classes, and a TaggedMapOutput class which
should be the base class for the mapper output values. But there aren't any
examples showing how these classes should be used to implement inner or
outer joins in a generic manner.

 

If anybody has used this package and would like to share their experience,
please let me know.

 

Thanks,

 

Rahul Sood

[EMAIL PROTECTED]

 



Re: Not allow file split

2008-05-07 Thread Rahul Sood
You can implement a custom input format and a record reader. Assuming
your record data type is class RecType, the input format should subclass
FileInputFormat LongWritable, RecType  and the record reader should
implement RecordReader  LongWritable, RecType 

In this case the key could be the offset into the file, although it is
not very useful since you treat the entire file as one record. 

The isSplitable() method in the input format should return false.
The RecordReader.next( LongWritable pos, RecType val ) method should
read the entire file and set val to the file contents. This will ensure
that the entire file goes to one map task as a single record.

-Rahul Sood
[EMAIL PROTECTED]

 Hi at all, I'm a newbie and I have the following problem.
 
 I need to implement an InputFormat such that the isSplitable always
 returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question
 no 10).
 And here there is the problem.
 
 I have also to implement the RecordReader interface for returning the
 whole content of the input file but I don't know how. I have found
 only examples that uses the LineRecordReader
 
 Someone can help me?
 
 Thanks
 



Java inputformat for pipes job

2008-04-08 Thread Rahul Sood
Hi,

I implemented a customized input format in Java for a Map Reduce job.
The mapper and reducer classes are implemented in C++, using the Hadoop
Pipes API. 

The package documentation for org.apache.hadoop.mapred.pipes states that
The job may consist of any combination of Java and C++ RecordReaders,
Mappers, Paritioner, Combiner, Reducer, and RecordWriter

I packaged the input format class in a jar file and ran the job
invocation command:

hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr

It keeps failing with error ClassNotFoundException
Although I've specified the jar file name with the -jar parameter, the
input format class still cannot be located. Is there any other means to
specify the input format class, or the job jar file, for a Pipes job ?

Stack trace:

Exception in thread main java.lang.ClassNotFoundException:
mytest.PriceInputFormat
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
at
org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309)
at
org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357)

Thanks,

Rahul Sood
[EMAIL PROTECTED]




Re: Java inputformat for pipes job

2008-04-08 Thread Rahul Sood
I'm invoking hadoop with pipes command:

hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr

I tried the -file and -cacheFile options but when either of these is
passed to hadoop pipes, the command just exits with a usage message.

There must be a way to specify a jar for a job implemented in C++ with
the hadoop Pipes API. The documentation states that record readers and
writers for Pipes jobs can be implemented in java. I looked at the
source code of org.apache.hadoop.mapred.pipes.Submitter and it's doing
the following:

/**
 * The main entry point and job submitter. It may either be used as
 * a command line-based or API-based method to launch Pipes jobs.
 */
public class Submitter {

   /**
   * Submit a pipes job based on the command line arguments.
   * @param args
   */
  public static void main(String[] args) throws Exception {
CommandLineParser cli = new CommandLineParser();
//...
  if (results.hasOption(-inputformat)) {
setIsJavaRecordReader(conf, true);
conf.setInputFormat(getClass(results, -inputformat, conf,
 InputFormat.class));
  }
  }
}

 It is loading the input format class based on the value of the
-inputformat cmdline parameter. That means there should be some way to
package the input format class along with the program binary and other
supporting files.

-Rahul Sood
[EMAIL PROTECTED]

 You should use the -pipes option in the command.
 For the input format, you can pack it into the hadoop core class jar file,
 or put it into the cache file.
 
 2008/4/8, Rahul Sood [EMAIL PROTECTED]:
 
  Hi,
 
  I implemented a customized input format in Java for a Map Reduce job.
  The mapper and reducer classes are implemented in C++, using the Hadoop
  Pipes API.
 
  The package documentation for org.apache.hadoop.mapred.pipes states that
  The job may consist of any combination of Java and C++ RecordReaders,
  Mappers, Paritioner, Combiner, Reducer, and RecordWriter
 
  I packaged the input format class in a jar file and ran the job
  invocation command:
 
  hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf
  conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr
 
  It keeps failing with error ClassNotFoundException
  Although I've specified the jar file name with the -jar parameter, the
  input format class still cannot be located. Is there any other means to
  specify the input format class, or the job jar file, for a Pipes job ?
 
  Stack trace:
 
  Exception in thread main java.lang.ClassNotFoundException:
  mytest.PriceInputFormat
  at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
  at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:247)
  at
 
  org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
  at
  org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309)
  at
  org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357)
 
  Thanks,
 
 
  Rahul Sood
  [EMAIL PROTECTED]
 
 
 



Pipes task being killed

2008-03-05 Thread Rahul Sood
Hi,

We have a Pipes C++ application where the Reduce task does a lot of
computation. After some time the task gets killed by the Hadoop
framework. The job output shows the following error:

Task task_200803051654_0001_r_00_0 failed to report status for 604
seconds. Killing!

Is there any way to send a heartbeat to the TaskTracker from a Pipes
application. I believe this is possible in Java using
org.apache.hadoop.util.Progress and we're looking for something
equivalent in the C++ Pipes API.

-Rahul Sood
[EMAIL PROTECTED]