Cheers Alan, Done.
Rob. 2010/1/14 Alan Gates <[email protected]> > Rob, > > Feel free to update the wiki with your findings. You don't have to be a > committer to change the wiki. > > Alan. > > > On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote: > > Hello Dmitry! >> >> I have it solved, it was just a bit of trial and error based on the Hive >> bug >> report/fix I found. >> >> The report is indeed correct, the following works: >> >>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator >>> >> -libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f >> /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >> >> This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant >> for Hadoop 0.18 and less. >> >> May I propose that you update the wiki as such: >> ------------------------ >> DataGenerator Usage: >> For 0.18.0 >> >>> hadoop jar -libjars $zipfjar $datagenjar >>> >> org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf >> $conf_file [options] colspec... >> >> For 0.20.0 >> >>> hadoop jar $datagenjar >>> org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator> >>> -libjars >>> >> $zipfjar -conf $conf_file [options] colspec... >> -------------- >> >> Sound OK ? >> >> >> Rob Stewart >> >> >> 2010/1/14 Rob Stewart <[email protected]> >> >> Yeah, unfortunately your suggestion does not work, and neither does the >>> order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars >>> usage: >>> >>> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars >>> mylib.jar input output >>> >>> So I tried this: >>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator >>> -conf $conf_file -rows 10000000 -f >>> /scratch/tmpHDFS_files/wordsx1_skewed.dat >>> -libjars $zipfjar s:8:50:z:0 >>> >>> However, the DataGenerator does not like it as one of its' options: >>> --------- >>> Couldn't parse the command line arguments, Found unknown option >>> (-libjars) >>> at position 5 >>> --------- >>> >>> I'd be happy/surprised to hear from anyone who can use the format given >>> on >>> the Pig wiki for the DataGenerator, in cluster mode (using -m parameter). >>> >>> Any more suggestions Dmitry, and thanks for your help, it's mucho >>> appreciated! >>> >>> Rob >>> >>> >>> >>> 2010/1/14 Dmitriy Ryaboy <[email protected]> >>> >>> Sorry if I am not reading carefully enough -- but the bug report you >>>> cite seems to indicate you want >>>> >>>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars >>>> $zipfjar $datagenjar -conf $conf_file -rows >>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >>>> >>>> (possibly separating zipfjar and datagenjar with commas if that patch >>>> was applied to your version of 20) >>>> >>>> which I don't see in the list of things you tried? >>>> >>>> -D >>>> >>>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart >>>> <[email protected]> wrote: >>>> >>>>> Hi Dmitriy, >>>>> >>>>> No, I do think that there was a change in 0.20.0 >>>>> >>>>> See the error I get: >>>>> Exception in thread "main" java.io.IOException: Error opening job jar: >>>>> -libjars >>>>> >>>>> This is what I am trying to run: >>>>> hadoop jar -libjars $zipfjar $datagenjar >>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows >>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >>>>> >>>>> The $zipfjar has only one jar file in this classpath. It seems that >>>>> >>>> there >>>> >>>>> was a change to hadoop 0.20.0, not allowing for the option -libjars >>>>> immediately after "hadoop jar". >>>>> >>>>> This is the extract from the Hive bug report I was talking about: >>>>> ------------- >>>>> >>>>> >>>>> In hadoop-20 - the -libjars has to come after the jar file/class >>>>> >>>>> Please try applying this patch to bin/ext/cli.sh >>>>> >>>>> --- cli.sh (revision 789726) >>>>> +++ cli.sh (working copy) >>>>> @@ -10,7 +10,7 @@ >>>>> exit 3; >>>>> fi >>>>> >>>>> - exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS >>>>> $HIVE_OPTS "$@" >>>>> + exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE >>>>> $HIVE_OPTS "$@" >>>>> } >>>>> >>>>> ---------------- >>>>> >>>>> I have also tried: >>>>> hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar >>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows >>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >>>>> >>>>> This gives the same error. >>>>> >>>>> >>>>> >>>>> Rob >>>>> >>>>> 2010/1/14 Dmitriy Ryaboy <[email protected]> >>>>> >>>>> I think the link you sent got malformatted, but try separating the >>>>>> jars with a comma >>>>>> http://issues.apache.org/jira/browse/HADOOP-4864 >>>>>> >>>>>> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart >>>>>> <[email protected]> wrote: >>>>>> >>>>>>> Hi Dmitriy, >>>>>>> >>>>>>> OK, well it seems that since 0.20.0 the order as specified on the Pig >>>>>>> >>>>>> wiki >>>>>> >>>>>>> is no longer relevant: >>>>>>> doop jar -libjars $zipfjar $datagenjar >>>>>>> >>>>>> org.apache.pig.test.utils.datagen. >>>> >>>>> DataGenerator </pig/DataGenerator> -conf $conf_file [options] >>>>>>> >>>>>> colspec... >>>> >>>>> >>>>>>> See this patch over at Hive for 0.20.0: >>>>>>> >>>>>>> >>>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/< >>>> >>>>> dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com> >>>>>>> >>>>>>> I have tried a few combinations, but I can't seem to fit in the >>>>>>> >>>>>> "-libjars >>>> >>>>> $zipfjar" in anywhere now. >>>>>>> >>>>>>> Any ideas? >>>>>>> >>>>>>> Thanks for your help. >>>>>>> >>>>>>> Rob >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2010/1/14 Dmitriy Ryaboy <[email protected]> >>>>>>> >>>>>>> Rob, >>>>>>>> You need to tell Hadoop which jars you need it to ship to the worker >>>>>>>> nodes. You include datagen.jar, etc, on the classpath, which makes >>>>>>>> them discoverable locally, but you aren't telling Hadoop to ship >>>>>>>> >>>>>>> them. >>>> >>>>> You want to list them, comma-separated, in the -libjars parameter. >>>>>>>> >>>>>>>> -D >>>>>>>> >>>>>>>> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi there. >>>>>>>>> >>>>>>>>> I am well underway with comparing Pig, Hive, JAQL etc... >>>>>>>>> >>>>>>>>> The DataGenerator is proving a valuable tool for me. Thanks for >>>>>>>>> >>>>>>>> that. >>>> >>>>> >>>>>>>>> I have one query. I am able to use it in local mode, no problem, >>>>>>>>> >>>>>>>> and >>>> >>>>> some >>>>>> >>>>>>> experiments are complete. >>>>>>>>> >>>>>>>>> However, I cannot seem to use it in MapReduce mode on the cluster. >>>>>>>>> >>>>>>>> This >>>>>> >>>>>>> is >>>>>>>> >>>>>>>>> my file "generateData" contents: >>>>>>>>> ------------------ >>>>>>>>> export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar >>>>>>>>> export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar >>>>>>>>> export >>>>>>>>> >>>>>>>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar >>>> >>>>> export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml >>>>>>>>> export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar >>>>>>>>> /usr/lib/hadoop/bin/hadoop jar $datagenjar >>>>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file >>>>>>>>> >>>>>>>> -m 1 >>>> >>>>> -rows >>>>>>>> >>>>>>>>> 10000000 -f words.dat s:8:50:z:0 >>>>>>>>> ------------------ >>>>>>>>> >>>>>>>>> The error I receive when trying to run it with "-m 1" option (in >>>>>>>>> >>>>>>>> cluster >>>>>> >>>>>>> mode): >>>>>>>>> Caused by: java.lang.ClassNotFoundException: >>>>>>>>> >>>>>>>> sdsu.algorithms.data.Zipf >>>> >>>>> >>>>>>>>> So in local mode, it successfully picks up the jar file >>>>>>>>> >>>>>>>> sdsuLibJKD14.jar >>>>>> >>>>>>> , >>>>>>>> >>>>>>>>> but when running it in cluster mode, this classpath is not found? >>>>>>>>> >>>>>>>>> >>>>>>>>> thanks. >>>>>>>>> >>>>>>>>> Rob Stewart >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >
