contrib join package
Hi, Is there any detailed documentation on the org.apache.hadoop.contrib.utils.join package ? I have a simple Join task consisting of 2 input datasets. Each contains tab-separated records. Set1: Record format = field1\tfield2\tfield3\tfield4\tfield5 Set2: Record format = field1\tfield2\tfield3 Join criterion: Set1.field1 = Set2.field1 Output: Set2.field2\tSet1.field2\tSet1.field3\tSet1.field4 The org.apache.hadoop.contrib.utils.join package contains DataJoinMapperBase and DataJoinReducerBase abstract classes, and a TaggedMapOutput class which should be the base class for the mapper output values. But there aren't any examples showing how these classes should be used to implement inner or outer joins in a generic manner. If anybody has used this package and would like to share their experience, please let me know. Thanks, Rahul Sood [EMAIL PROTECTED]
Re: Not allow file split
You can implement a custom input format and a record reader. Assuming your record data type is class RecType, the input format should subclass FileInputFormat LongWritable, RecType and the record reader should implement RecordReader LongWritable, RecType In this case the key could be the offset into the file, although it is not very useful since you treat the entire file as one record. The isSplitable() method in the input format should return false. The RecordReader.next( LongWritable pos, RecType val ) method should read the entire file and set val to the file contents. This will ensure that the entire file goes to one map task as a single record. -Rahul Sood [EMAIL PROTECTED] Hi at all, I'm a newbie and I have the following problem. I need to implement an InputFormat such that the isSplitable always returns false ah shown in http://wiki.apache.org/hadoop/FAQ (question no 10). And here there is the problem. I have also to implement the RecordReader interface for returning the whole content of the input file but I don't know how. I have found only examples that uses the LineRecordReader Someone can help me? Thanks
Java inputformat for pipes job
Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer classes are implemented in C++, using the Hadoop Pipes API. The package documentation for org.apache.hadoop.mapred.pipes states that The job may consist of any combination of Java and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter I packaged the input format class in a jar file and ran the job invocation command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr It keeps failing with error ClassNotFoundException Although I've specified the jar file name with the -jar parameter, the input format class still cannot be located. Is there any other means to specify the input format class, or the job jar file, for a Pipes job ? Stack trace: Exception in thread main java.lang.ClassNotFoundException: mytest.PriceInputFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357) Thanks, Rahul Sood [EMAIL PROTECTED]
Re: Java inputformat for pipes job
I'm invoking hadoop with pipes command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr I tried the -file and -cacheFile options but when either of these is passed to hadoop pipes, the command just exits with a usage message. There must be a way to specify a jar for a job implemented in C++ with the hadoop Pipes API. The documentation states that record readers and writers for Pipes jobs can be implemented in java. I looked at the source code of org.apache.hadoop.mapred.pipes.Submitter and it's doing the following: /** * The main entry point and job submitter. It may either be used as * a command line-based or API-based method to launch Pipes jobs. */ public class Submitter { /** * Submit a pipes job based on the command line arguments. * @param args */ public static void main(String[] args) throws Exception { CommandLineParser cli = new CommandLineParser(); //... if (results.hasOption(-inputformat)) { setIsJavaRecordReader(conf, true); conf.setInputFormat(getClass(results, -inputformat, conf, InputFormat.class)); } } } It is loading the input format class based on the value of the -inputformat cmdline parameter. That means there should be some way to package the input format class along with the program binary and other supporting files. -Rahul Sood [EMAIL PROTECTED] You should use the -pipes option in the command. For the input format, you can pack it into the hadoop core class jar file, or put it into the cache file. 2008/4/8, Rahul Sood [EMAIL PROTECTED]: Hi, I implemented a customized input format in Java for a Map Reduce job. The mapper and reducer classes are implemented in C++, using the Hadoop Pipes API. The package documentation for org.apache.hadoop.mapred.pipes states that The job may consist of any combination of Java and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter I packaged the input format class in a jar file and ran the job invocation command: hadoop pipes -jar mytest.jar -inputformat mytest.PriceInputFormat -conf conf/mytest.xml -input mgr/in -output mgr/out -program mgr/bin/TestMgr It keeps failing with error ClassNotFoundException Although I've specified the jar file name with the -jar parameter, the input format class still cannot be located. Is there any other means to specify the input format class, or the job jar file, for a Pipes job ? Stack trace: Exception in thread main java.lang.ClassNotFoundException: mytest.PriceInputFormat at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) at org.apache.hadoop.mapred.pipes.Submitter.getClass(Submitter.java:309) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:357) Thanks, Rahul Sood [EMAIL PROTECTED]
Pipes task being killed
Hi, We have a Pipes C++ application where the Reduce task does a lot of computation. After some time the task gets killed by the Hadoop framework. The job output shows the following error: Task task_200803051654_0001_r_00_0 failed to report status for 604 seconds. Killing! Is there any way to send a heartbeat to the TaskTracker from a Pipes application. I believe this is possible in Java using org.apache.hadoop.util.Progress and we're looking for something equivalent in the C++ Pipes API. -Rahul Sood [EMAIL PROTECTED]