Hello Dmitry! I have it solved, it was just a bit of trial and error based on the Hive bug report/fix I found.
The report is indeed correct, the following works: > hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant for Hadoop 0.18 and less. May I propose that you update the wiki as such: ------------------------ DataGenerator Usage: For 0.18.0 > hadoop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf $conf_file [options] colspec... For 0.20.0 > hadoop jar $datagenjar > org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator> -libjars $zipfjar -conf $conf_file [options] colspec... -------------- Sound OK ? Rob Stewart 2010/1/14 Rob Stewart <[email protected]> > Yeah, unfortunately your suggestion does not work, and neither does the > order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars > usage: > > hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars > mylib.jar input output > > So I tried this: > hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator > -conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat > -libjars $zipfjar s:8:50:z:0 > > However, the DataGenerator does not like it as one of its' options: > --------- > Couldn't parse the command line arguments, Found unknown option (-libjars) > at position 5 > --------- > > I'd be happy/surprised to hear from anyone who can use the format given on > the Pig wiki for the DataGenerator, in cluster mode (using -m parameter). > > Any more suggestions Dmitry, and thanks for your help, it's mucho > appreciated! > > Rob > > > > 2010/1/14 Dmitriy Ryaboy <[email protected]> > >> Sorry if I am not reading carefully enough -- but the bug report you >> cite seems to indicate you want >> >> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars >> $zipfjar $datagenjar -conf $conf_file -rows >> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >> >> (possibly separating zipfjar and datagenjar with commas if that patch >> was applied to your version of 20) >> >> which I don't see in the list of things you tried? >> >> -D >> >> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart >> <[email protected]> wrote: >> > Hi Dmitriy, >> > >> > No, I do think that there was a change in 0.20.0 >> > >> > See the error I get: >> > Exception in thread "main" java.io.IOException: Error opening job jar: >> > -libjars >> > >> > This is what I am trying to run: >> > hadoop jar -libjars $zipfjar $datagenjar >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows >> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >> > >> > The $zipfjar has only one jar file in this classpath. It seems that >> there >> > was a change to hadoop 0.20.0, not allowing for the option -libjars >> > immediately after "hadoop jar". >> > >> > This is the extract from the Hive bug report I was talking about: >> > ------------- >> > >> > >> > In hadoop-20 - the -libjars has to come after the jar file/class >> > >> > Please try applying this patch to bin/ext/cli.sh >> > >> > --- cli.sh (revision 789726) >> > +++ cli.sh (working copy) >> > @@ -10,7 +10,7 @@ >> > exit 3; >> > fi >> > >> > - exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS >> > $HIVE_OPTS "$@" >> > + exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE >> > $HIVE_OPTS "$@" >> > } >> > >> > ---------------- >> > >> > I have also tried: >> > hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows >> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 >> > >> > This gives the same error. >> > >> > >> > >> > Rob >> > >> > 2010/1/14 Dmitriy Ryaboy <[email protected]> >> > >> >> I think the link you sent got malformatted, but try separating the >> >> jars with a comma >> >> http://issues.apache.org/jira/browse/HADOOP-4864 >> >> >> >> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart >> >> <[email protected]> wrote: >> >> > Hi Dmitriy, >> >> > >> >> > OK, well it seems that since 0.20.0 the order as specified on the Pig >> >> wiki >> >> > is no longer relevant: >> >> > doop jar -libjars $zipfjar $datagenjar >> org.apache.pig.test.utils.datagen. >> >> > DataGenerator </pig/DataGenerator> -conf $conf_file [options] >> colspec... >> >> > >> >> > See this patch over at Hive for 0.20.0: >> >> > >> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/< >> >> > dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com> >> >> > >> >> > I have tried a few combinations, but I can't seem to fit in the >> "-libjars >> >> > $zipfjar" in anywhere now. >> >> > >> >> > Any ideas? >> >> > >> >> > Thanks for your help. >> >> > >> >> > Rob >> >> > >> >> > >> >> > >> >> > >> >> > 2010/1/14 Dmitriy Ryaboy <[email protected]> >> >> > >> >> >> Rob, >> >> >> You need to tell Hadoop which jars you need it to ship to the worker >> >> >> nodes. You include datagen.jar, etc, on the classpath, which makes >> >> >> them discoverable locally, but you aren't telling Hadoop to ship >> them. >> >> >> You want to list them, comma-separated, in the -libjars parameter. >> >> >> >> >> >> -D >> >> >> >> >> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart >> >> >> <[email protected]> wrote: >> >> >> > Hi there. >> >> >> > >> >> >> > I am well underway with comparing Pig, Hive, JAQL etc... >> >> >> > >> >> >> > The DataGenerator is proving a valuable tool for me. Thanks for >> that. >> >> >> > >> >> >> > I have one query. I am able to use it in local mode, no problem, >> and >> >> some >> >> >> > experiments are complete. >> >> >> > >> >> >> > However, I cannot seem to use it in MapReduce mode on the cluster. >> >> This >> >> >> is >> >> >> > my file "generateData" contents: >> >> >> > ------------------ >> >> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar >> >> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar >> >> >> > export >> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar >> >> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml >> >> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar >> >> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar >> >> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file >> -m 1 >> >> >> -rows >> >> >> > 10000000 -f words.dat s:8:50:z:0 >> >> >> > ------------------ >> >> >> > >> >> >> > The error I receive when trying to run it with "-m 1" option (in >> >> cluster >> >> >> > mode): >> >> >> > Caused by: java.lang.ClassNotFoundException: >> sdsu.algorithms.data.Zipf >> >> >> > >> >> >> > So in local mode, it successfully picks up the jar file >> >> sdsuLibJKD14.jar >> >> >> , >> >> >> > but when running it in cluster mode, this classpath is not found? >> >> >> > >> >> >> > >> >> >> > thanks. >> >> >> > >> >> >> > Rob Stewart >> >> >> > >> >> >> >> >> > >> >> >> > >> > >
