Re: Pig DataGenerator as a MR Job

Rob Stewart Thu, 14 Jan 2010 12:16:42 -0800

Hello Dmitry!

I have it solved, it was just a bit of trial and error based on the Hive bug
report/fix I found.


The report is indeed correct, the following works:
> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
-libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
/scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant
for Hadoop 0.18 and less.

May I propose that you update the wiki as such:
------------------------
DataGenerator Usage:
For 0.18.0
> hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf
$conf_file [options] colspec...

For 0.20.0
> hadoop jar $datagenjar 
> org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator>  -libjars
$zipfjar -conf $conf_file [options] colspec...
--------------

Sound OK ?


Rob Stewart


2010/1/14 Rob Stewart <[email protected]>

> Yeah, unfortunately your suggestion does not work, and neither does the
> order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars
> usage:
>
> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
> mylib.jar input output
>
> So I tried this:
> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
> -conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat
> -libjars $zipfjar s:8:50:z:0
>
> However, the DataGenerator does not like it as one of its' options:
> ---------
> Couldn't parse the command line arguments, Found unknown option (-libjars)
> at position 5
> ---------
>
> I'd be happy/surprised to hear from anyone who can use the format given on
> the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).
>
> Any more suggestions Dmitry, and thanks for your help, it's mucho
> appreciated!
>
> Rob
>
>
>
> 2010/1/14 Dmitriy Ryaboy <[email protected]>
>
>> Sorry if I am not reading carefully enough -- but the bug report you
>> cite seems to indicate you want
>>
>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
>> $zipfjar $datagenjar -conf $conf_file -rows
>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>
>> (possibly separating zipfjar and datagenjar with commas if that patch
>> was applied to your version of 20)
>>
>> which I don't see in the list of things you tried?
>>
>> -D
>>
>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
>> <[email protected]> wrote:
>> > Hi Dmitriy,
>> >
>> > No, I do think that there was a change in 0.20.0
>> >
>> > See the error I get:
>> > Exception in thread "main" java.io.IOException: Error opening job jar:
>> > -libjars
>> >
>> > This is what I am trying to run:
>> > hadoop jar -libjars $zipfjar $datagenjar
>> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>> >
>> > The $zipfjar has only one jar file in this classpath. It seems that
>> there
>> > was a change to hadoop 0.20.0, not allowing for the option -libjars
>> > immediately after "hadoop jar".
>> >
>> > This is the extract from the Hive bug report I was talking about:
>> > -------------
>> >
>> >
>> > In hadoop-20 - the -libjars has to come after the jar file/class
>> >
>> > Please try applying this patch to bin/ext/cli.sh
>> >
>> > --- cli.sh  (revision 789726)
>> > +++ cli.sh  (working copy)
>> > @@ -10,7 +10,7 @@
>> >     exit 3;
>> >   fi
>> >
>> > -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
>> > $HIVE_OPTS "$@"
>> > +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
>> > $HIVE_OPTS "$@"
>> >  }
>> >
>> > ----------------
>> >
>> > I have also tried:
>> > hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
>> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>> > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>> >
>> > This gives the same error.
>> >
>> >
>> >
>> > Rob
>> >
>> > 2010/1/14 Dmitriy Ryaboy <[email protected]>
>> >
>> >> I think the link you sent got malformatted, but try separating the
>> >> jars with a comma
>> >> http://issues.apache.org/jira/browse/HADOOP-4864
>> >>
>> >> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>> >> <[email protected]> wrote:
>> >> > Hi Dmitriy,
>> >> >
>> >> > OK, well it seems that since 0.20.0 the order as specified on the Pig
>> >> wiki
>> >> > is no longer relevant:
>> >> > doop jar -libjars $zipfjar $datagenjar
>> org.apache.pig.test.utils.datagen.
>> >> > DataGenerator </pig/DataGenerator> -conf $conf_file [options]
>> colspec...
>> >> >
>> >> > See this patch over at Hive for 0.20.0:
>> >> >
>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
>> >> > dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com>
>> >> >
>> >> > I have tried a few combinations, but I can't seem to fit in the
>> "-libjars
>> >> > $zipfjar" in anywhere now.
>> >> >
>> >> > Any ideas?
>> >> >
>> >> > Thanks for your help.
>> >> >
>> >> > Rob
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > 2010/1/14 Dmitriy Ryaboy <[email protected]>
>> >> >
>> >> >> Rob,
>> >> >> You need to tell Hadoop which jars you need it to ship to the worker
>> >> >> nodes. You include datagen.jar, etc, on the classpath, which makes
>> >> >> them discoverable locally, but you aren't telling Hadoop to ship
>> them.
>> >> >> You want to list them, comma-separated, in the -libjars parameter.
>> >> >>
>> >> >> -D
>> >> >>
>> >> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>> >> >> <[email protected]> wrote:
>> >> >> > Hi there.
>> >> >> >
>> >> >> > I am well underway with comparing Pig, Hive, JAQL etc...
>> >> >> >
>> >> >> > The DataGenerator is proving a valuable tool for me. Thanks for
>> that.
>> >> >> >
>> >> >> > I have one query. I am able to use it in local mode, no problem,
>> and
>> >> some
>> >> >> > experiments are complete.
>> >> >> >
>> >> >> > However, I cannot seem to use it in MapReduce mode on the cluster.
>> >> This
>> >> >> is
>> >> >> > my file "generateData" contents:
>> >> >> > ------------------
>> >> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>> >> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>> >> >> > export
>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>> >> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>> >> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>> >> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar
>> >> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file
>> -m 1
>> >> >> -rows
>> >> >> > 10000000 -f words.dat s:8:50:z:0
>> >> >> > ------------------
>> >> >> >
>> >> >> > The error I receive when trying to run it with "-m 1" option (in
>> >> cluster
>> >> >> > mode):
>> >> >> > Caused by: java.lang.ClassNotFoundException:
>> sdsu.algorithms.data.Zipf
>> >> >> >
>> >> >> > So in local mode, it successfully picks up the jar file
>> >> sdsuLibJKD14.jar
>> >> >> ,
>> >> >> > but when running it in cluster mode, this classpath is not found?
>> >> >> >
>> >> >> >
>> >> >> > thanks.
>> >> >> >
>> >> >> > Rob Stewart
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>
>

Re: Pig DataGenerator as a MR Job

Reply via email to