I modified the DataJoinJob.createDataJoinJob() slightly: if (args[7].compareToIgnoreCase("text") != 0) { SequenceFileOutputFormat.setOutputCompressionType(job, SequenceFile.CompressionType.BLOCK); } job.setBoolean("mapred.output.compress", false);
But I still see non-text output: output/part-00000.deflate 'hadoop fs -text output/part-00000.deflate' doesn't show readable text either. On Mon, Mar 29, 2010 at 9:26 AM, Ted Yu <yuzhih...@gmail.com> wrote: > I can run the sample (I created the input files according to > contrib/data_join/src/examples/org/apache/hadoop/contrib/utils/join/README.txt): > > [r...@tyu-linux datajoin]# pwd > /opt/ks/hadoop-0.20.2/build/contrib/datajoin > [r...@tyu-linux datajoin]# /opt/ks/hadoop-0.20.2/bin/hadoop jar > hadoop-0.20.2-datajoin-examples.jar > org.apache.hadoop.contrib.utils.join.DataJoinJob input output Text 1 > org.apache.hadoop.contrib.utils.join.SampleDataJoinMapper > org.apache.hadoop.contrib.utils.join.SampleDataJoinReducer > org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput Text > Using TextInputFormat: Text > Using TextOutputFormat: Text > 10/03/29 09:01:30 INFO jvm.JvmMetrics: Initializing JVM Metrics with > processName=JobTracker, sessionId= > 10/03/29 09:01:30 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > 10/03/29 09:01:30 INFO mapred.FileInputFormat: Total input paths to process > : 2 > Job job_local_0001 is submitted > Job job_local_0001 is still running. > 10/03/29 09:01:30 INFO mapred.FileInputFormat: Total input paths to process > : 2 > 10/03/29 09:01:31 INFO mapred.MapTask: numReduceTasks: 1 > 10/03/29 09:01:31 INFO mapred.MapTask: io.sort.mb = 100 > 10/03/29 09:01:31 INFO mapred.MapTask: data buffer = 79691776/99614720 > 10/03/29 09:01:31 INFO mapred.MapTask: record buffer = 262144/327680 > 10/03/29 09:01:31 INFO mapred.MapTask: Starting flush of map output > 10/03/29 09:01:31 INFO mapred.MapTask: Finished spill 0 > 10/03/29 09:01:32 INFO mapred.TaskRunner: > Task:attempt_local_0001_m_000000_0 is done. And is in the process of > commiting > 10/03/29 09:01:32 INFO mapred.LocalJobRunner: collectedCount 6 > totalCount 6 > > 10/03/29 09:01:32 INFO mapred.TaskRunner: Task > 'attempt_local_0001_m_000000_0' done. > 10/03/29 09:01:32 INFO mapred.MapTask: numReduceTasks: 1 > 10/03/29 09:01:32 INFO mapred.MapTask: io.sort.mb = 100 > 10/03/29 09:01:32 INFO mapred.MapTask: data buffer = 79691776/99614720 > 10/03/29 09:01:32 INFO mapred.MapTask: record buffer = 262144/327680 > 10/03/29 09:01:32 INFO mapred.MapTask: Starting flush of map output > 10/03/29 09:01:32 INFO mapred.MapTask: Finished spill 0 > 10/03/29 09:01:32 INFO mapred.TaskRunner: > Task:attempt_local_0001_m_000001_0 is done. And is in the process of > commiting > 10/03/29 09:01:32 INFO mapred.LocalJobRunner: collectedCount 5 > totalCount 5 > > 10/03/29 09:01:32 INFO mapred.TaskRunner: Task > 'attempt_local_0001_m_000001_0' done. > 10/03/29 09:01:32 INFO mapred.LocalJobRunner: > 10/03/29 09:01:32 INFO mapred.Merger: Merging 2 sorted segments > 10/03/29 09:01:32 INFO mapred.Merger: Down to the last merge-pass, with 2 > segments left of total size: 939 bytes > 10/03/29 09:01:32 INFO mapred.LocalJobRunner: > 10/03/29 09:01:32 INFO util.NativeCodeLoader: Loaded the native-hadoop > library > 10/03/29 09:01:32 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > 10/03/29 09:01:32 INFO datajoin.job: key: A.a11 this.largestNumOfValues: 3 > 10/03/29 09:01:32 INFO mapred.TaskRunner: > Task:attempt_local_0001_r_000000_0 is done. And is in the process of > commiting > 10/03/29 09:01:32 INFO mapred.LocalJobRunner: > 10/03/29 09:01:32 INFO mapred.TaskRunner: Task > attempt_local_0001_r_000000_0 is allowed to commit now > 10/03/29 09:01:32 INFO mapred.FileOutputCommitter: Saved output of task > 'attempt_local_0001_r_000000_0' to > file:/opt/kindsight/hadoop-0.20.2/build/contrib/datajoin/output > 10/03/29 09:01:32 INFO mapred.LocalJobRunner: actuallyCollectedCount 5 > collectedCount 7 > groupCount 6 > > reduce > 10/03/29 09:01:32 INFO mapred.TaskRunner: Task > 'attempt_local_0001_r_000000_0' done. > [r...@tyu-linux datajoin]# date > Mon Mar 29 09:02:37 PDT 2010 > > It took a minute between the last INFO log and exit of DataJoinJob. > > Cheers > > > On Mon, Mar 29, 2010 at 8:26 AM, M B <machac...@gmail.com> wrote: > >> Sorry, I should have mentioned that I tried that as well and it also gives >> an error: >> >> $ <p...@hadoop01:~/hadoop_tests$> hadoop jar -libjars ./samplejoin.jar >> /opt/hadoop-0.20.2/contrib/datajoin/hadoop-0.20.2-datajoin.jar >> org.apache.hadoop.contrib.utils.join.DataJoinJob datajoin/input >> datajoin/output Text 1 >> org.apache.hadoop.contrib.utils.join.SampleDataJoinMapper >> org.apache.hadoop.contrib.utils.join.SampleDataJoinReducer >> org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput Text >> Exception in thread "main" java.io.IOException: Error opening job jar: >> -libjars >> at org.apache.hadoop.util.RunJar.main(RunJar.java:90) >> Caused by: java.util.zip.ZipException: error in opening zip file >> at java.util.zip.ZipFile.open(Native Method) >> at java.util.zip.ZipFile.<init>(ZipFile.java:114) >> at java.util.jar.JarFile.<init>(JarFile.java:133) >> at java.util.jar.JarFile.<init>(JarFile.java:70) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:88) >> Has something changed or is my environment not set up correctly? >> Appreciate >> any help. >> >> >> >> On Fri, Mar 26, 2010 at 8:23 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> > Then use the syntax given by >> > >> > >> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html >> > : >> > >> > $ bin/hadoop jar -libjars ./samplejoin.jar >> > /opt/hadoop-0.20.2/contrib/datajoin/hadoop-0.20.2-datajoin.jar >> > org.apache.hadoop.contrib.utils.join.DataJoinJob datajoin/input ... >> > >> > On Fri, Mar 26, 2010 at 5:10 PM, M B <machac...@gmail.com> wrote: >> > >> > > Sorry, but where exactly do I include the libjars option? I tried to >> put >> > > it >> > > where you stated (after the DataJoinJob class), but it just comes back >> > with >> > > usage information (as if the option is not valid): >> > > $ <p...@hadoop01:~/hadoop_tests$> hadoop jar >> > > /opt/hadoop-0.20.2/contrib/datajoin/hadoop-0.20.2-datajoin.jar >> > > org.apache.hadoop.contrib.utils.join.DataJoinJob -libjars >> > ./samplejoin.jar >> > > datajoin/input datajoin/output Text 1 >> > > org.apache.hadoop.contrib.utils.join.SampleDataJoinMapper >> > > org.apache.hadoop.contrib.utils.join.SampleDataJoinReducer >> > > org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput Text >> > > *usage: DataJoinJob inputdirs outputdir map_input_file_format >> numofParts >> > > mapper_class reducer_class map_output_value_class output_value_class >> > > [maxNumOfValuesPerGroup [descriptionOfJob]]]* >> > > >> > > It seems like it's not taking the option for some reason, like it's >> > failing >> > > an argument check in DataJoinJob - does that not use the standard args >> or >> > > something? >> > > >> > > >> > > On Fri, Mar 26, 2010 at 4:38 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> > > >> > > > DataJoinJob is contained in hadoop-0.20.2-datajoin.jar which is in >> your >> > > > HADOOP_CLASSPATH >> > > > >> > > > I think you should specify samplejoin.jar using -libjars instead of >> > > putting >> > > > it directly after jar command: >> > > > hadoop jar hadoop-0.20.2-datajoin.jar >> > > > org.apache.hadoop.contrib.utils.join.DataJoinJob -libjars >> > > ./samplejoin.jar >> > > > ... (same as your example) >> > > > >> > > > Cheers >> > > > >> > > > On Fri, Mar 26, 2010 at 3:24 PM, M B <machac...@gmail.com> wrote: >> > > > >> > > > > I may be having a setup issue with classpaths, would appreciate >> some >> > > > help. >> > > > > >> > > > > I created a jar with all the Sample* classes in contrib/DataJoin. >> > Here >> > > > is >> > > > > the listing of my samplejoin.jar file: >> > > > > " zip.vim version v22 >> > > > > " Browsing zipfile /home/hadoop/hadoop_tests/samplejoin.jar >> > > > > " Select a file with cursor and press ENTER >> > > > > META-INF/ >> > > > > META-INF/MANIFEST.MF >> > > > > org/ >> > > > > org/apache/ >> > > > > org/apache/hadoop/ >> > > > > org/apache/hadoop/contrib/ >> > > > > org/apache/hadoop/contrib/utils/ >> > > > > org/apache/hadoop/contrib/utils/join/ >> > > > > org/apache/hadoop/contrib/utils/join/SampleDataJoinReducer.class >> > > > > org/apache/hadoop/contrib/utils/join/SampleTaggedMapOutput.class >> > > > > org/apache/hadoop/contrib/utils/join/SampleDataJoinMapper.class >> > > > > >> > > > > When I go to run this, things start to run, but every Map try >> errors >> > > out >> > > > > with: >> > > > > "java.lang.RuntimeException: java.lang.ClassNotFoundException: >> > > > > org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput" >> > > > > >> > > > > Here is the command: >> > > > > hadoop jar ./samplejoin.jar >> > > > > org.apache.hadoop.contrib.utils.join.DataJoinJob >> > > > > datajoin/input datajoin/output Text 1 >> > > > > org.apache.hadoop.contrib.utils.join.SampleDataJoinMapper >> > > > > org.apache.hadoop.contrib.utils.join.SampleDataJoinReducer >> > > > > org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput Text >> > > > > >> > > > > This is a new install of 0.20.2. >> > > > > >> > > > > HADOOP_CLASSPATH is set >> > > > > to: /opt/hadoop-0.20.2/contrib/datajoin/hadoop-0.20.2-datajoin.jar >> > > > > Any help would be appreciated. >> > > > > >> > > > >> > > >> > >> > >