Re: Pig DataGenerator as a MR Job

Rob Stewart Thu, 14 Jan 2010 16:17:41 -0800

Cheers Alan,

Done.


Rob.


2010/1/14 Alan Gates <[email protected]>

> Rob,
>
> Feel free to update the wiki with your findings.  You don't have to be a
> committer to change the wiki.
>
> Alan.
>
>
> On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote:
>
>  Hello Dmitry!
>>
>> I have it solved, it was just a bit of trial and error based on the Hive
>> bug
>> report/fix I found.
>>
>> The report is indeed correct, the following works:
>>
>>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
>>>
>> -libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
>> /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>
>> This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant
>> for Hadoop 0.18 and less.
>>
>> May I propose that you update the wiki as such:
>> ------------------------
>> DataGenerator Usage:
>> For 0.18.0
>>
>>> hadoop jar -libjars $zipfjar $datagenjar
>>>
>> org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf
>> $conf_file [options] colspec...
>>
>> For 0.20.0
>>
>>> hadoop jar $datagenjar
>>> org.apache.pig.test.utils.datagen.DataGenerator</pig/DataGenerator>
>>>  -libjars
>>>
>> $zipfjar -conf $conf_file [options] colspec...
>> --------------
>>
>> Sound OK ?
>>
>>
>> Rob Stewart
>>
>>
>> 2010/1/14 Rob Stewart <[email protected]>
>>
>>  Yeah, unfortunately your suggestion does not work, and neither does the
>>> order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars
>>> usage:
>>>
>>> hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars
>>> mylib.jar input output
>>>
>>> So I tried this:
>>> hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
>>> -conf $conf_file -rows 10000000 -f
>>> /scratch/tmpHDFS_files/wordsx1_skewed.dat
>>> -libjars $zipfjar s:8:50:z:0
>>>
>>> However, the DataGenerator does not like it as one of its' options:
>>> ---------
>>> Couldn't parse the command line arguments, Found unknown option
>>> (-libjars)
>>> at position 5
>>> ---------
>>>
>>> I'd be happy/surprised to hear from anyone who can use the format given
>>> on
>>> the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).
>>>
>>> Any more suggestions Dmitry, and thanks for your help, it's mucho
>>> appreciated!
>>>
>>> Rob
>>>
>>>
>>>
>>> 2010/1/14 Dmitriy Ryaboy <[email protected]>
>>>
>>>  Sorry if I am not reading carefully enough -- but the bug report you
>>>> cite seems to indicate you want
>>>>
>>>> hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
>>>> $zipfjar $datagenjar -conf $conf_file -rows
>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>
>>>> (possibly separating zipfjar and datagenjar with commas if that patch
>>>> was applied to your version of 20)
>>>>
>>>> which I don't see in the list of things you tried?
>>>>
>>>> -D
>>>>
>>>> On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
>>>> <[email protected]> wrote:
>>>>
>>>>> Hi Dmitriy,
>>>>>
>>>>> No, I do think that there was a change in 0.20.0
>>>>>
>>>>> See the error I get:
>>>>> Exception in thread "main" java.io.IOException: Error opening job jar:
>>>>> -libjars
>>>>>
>>>>> This is what I am trying to run:
>>>>> hadoop jar -libjars $zipfjar $datagenjar
>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>
>>>>> The $zipfjar has only one jar file in this classpath. It seems that
>>>>>
>>>> there
>>>>
>>>>> was a change to hadoop 0.20.0, not allowing for the option -libjars
>>>>> immediately after "hadoop jar".
>>>>>
>>>>> This is the extract from the Hive bug report I was talking about:
>>>>> -------------
>>>>>
>>>>>
>>>>> In hadoop-20 - the -libjars has to come after the jar file/class
>>>>>
>>>>> Please try applying this patch to bin/ext/cli.sh
>>>>>
>>>>> --- cli.sh  (revision 789726)
>>>>> +++ cli.sh  (working copy)
>>>>> @@ -10,7 +10,7 @@
>>>>>   exit 3;
>>>>>  fi
>>>>>
>>>>> -  exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
>>>>> $HIVE_OPTS "$@"
>>>>> +  exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
>>>>> $HIVE_OPTS "$@"
>>>>> }
>>>>>
>>>>> ----------------
>>>>>
>>>>> I have also tried:
>>>>> hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows
>>>>> 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
>>>>>
>>>>> This gives the same error.
>>>>>
>>>>>
>>>>>
>>>>> Rob
>>>>>
>>>>> 2010/1/14 Dmitriy Ryaboy <[email protected]>
>>>>>
>>>>>  I think the link you sent got malformatted, but try separating the
>>>>>> jars with a comma
>>>>>> http://issues.apache.org/jira/browse/HADOOP-4864
>>>>>>
>>>>>> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Dmitriy,
>>>>>>>
>>>>>>> OK, well it seems that since 0.20.0 the order as specified on the Pig
>>>>>>>
>>>>>> wiki
>>>>>>
>>>>>>> is no longer relevant:
>>>>>>> doop jar -libjars $zipfjar $datagenjar
>>>>>>>
>>>>>> org.apache.pig.test.utils.datagen.
>>>>
>>>>> DataGenerator </pig/DataGenerator> -conf $conf_file [options]
>>>>>>>
>>>>>> colspec...
>>>>
>>>>>
>>>>>>> See this patch over at Hive for 0.20.0:
>>>>>>>
>>>>>>>
>>>> http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/<
>>>>
>>>>> dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com>
>>>>>>>
>>>>>>> I have tried a few combinations, but I can't seem to fit in the
>>>>>>>
>>>>>> "-libjars
>>>>
>>>>> $zipfjar" in anywhere now.
>>>>>>>
>>>>>>> Any ideas?
>>>>>>>
>>>>>>> Thanks for your help.
>>>>>>>
>>>>>>> Rob
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2010/1/14 Dmitriy Ryaboy <[email protected]>
>>>>>>>
>>>>>>>  Rob,
>>>>>>>> You need to tell Hadoop which jars you need it to ship to the worker
>>>>>>>> nodes. You include datagen.jar, etc, on the classpath, which makes
>>>>>>>> them discoverable locally, but you aren't telling Hadoop to ship
>>>>>>>>
>>>>>>> them.
>>>>
>>>>>  You want to list them, comma-separated, in the -libjars parameter.
>>>>>>>>
>>>>>>>> -D
>>>>>>>>
>>>>>>>> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi there.
>>>>>>>>>
>>>>>>>>> I am well underway with comparing Pig, Hive, JAQL etc...
>>>>>>>>>
>>>>>>>>> The DataGenerator is proving a valuable tool for me. Thanks for
>>>>>>>>>
>>>>>>>> that.
>>>>
>>>>>
>>>>>>>>> I have one query. I am able to use it in local mode, no problem,
>>>>>>>>>
>>>>>>>> and
>>>>
>>>>> some
>>>>>>
>>>>>>> experiments are complete.
>>>>>>>>>
>>>>>>>>> However, I cannot seem to use it in MapReduce mode on the cluster.
>>>>>>>>>
>>>>>>>> This
>>>>>>
>>>>>>> is
>>>>>>>>
>>>>>>>>> my file "generateData" contents:
>>>>>>>>> ------------------
>>>>>>>>> export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar
>>>>>>>>> export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar
>>>>>>>>> export
>>>>>>>>>
>>>>>>>> datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
>>>>
>>>>>  export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
>>>>>>>>> export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
>>>>>>>>> /usr/lib/hadoop/bin/hadoop jar $datagenjar
>>>>>>>>> org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file
>>>>>>>>>
>>>>>>>> -m 1
>>>>
>>>>>  -rows
>>>>>>>>
>>>>>>>>> 10000000 -f words.dat s:8:50:z:0
>>>>>>>>> ------------------
>>>>>>>>>
>>>>>>>>> The error I receive when trying to run it with "-m 1" option (in
>>>>>>>>>
>>>>>>>> cluster
>>>>>>
>>>>>>> mode):
>>>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>>>>
>>>>>>>> sdsu.algorithms.data.Zipf
>>>>
>>>>>
>>>>>>>>> So in local mode, it successfully picks up the jar file
>>>>>>>>>
>>>>>>>> sdsuLibJKD14.jar
>>>>>>
>>>>>>> ,
>>>>>>>>
>>>>>>>>> but when running it in cluster mode, this classpath is not found?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks.
>>>>>>>>>
>>>>>>>>> Rob Stewart
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>

Re: Pig DataGenerator as a MR Job

Reply via email to