Rob,

Feel free to update the wiki with your findings. You don't have to be a committer to change the wiki.

Alan.

On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote:

Hello Dmitry!

I have it solved, it was just a bit of trial and error based on the Hive bug
report/fix I found.

The report is indeed correct, the following works:
hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator
-libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
/scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

This puts the Pig wiki out of date for Hadoop 0.20, but is still relevant
for Hadoop 0.18 and less.

May I propose that you update the wiki as such:
------------------------
DataGenerator Usage:
For 0.18.0
hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator> -conf
$conf_file [options] colspec...

For 0.20.0
hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator</pig/ DataGenerator> -libjars
$zipfjar -conf $conf_file [options] colspec...
--------------

Sound OK ?


Rob Stewart


2010/1/14 Rob Stewart <[email protected]>

Yeah, unfortunately your suggestion does not work, and neither does the order given on the Pig wiki. Instead, see the Hadoop wiki for - libjars
usage:

hadoop jar hadoop-examples.jar wordcount -files cachefile.txt - libjars
mylib.jar input output

So I tried this:
hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/ wordsx1_skewed.dat
-libjars $zipfjar s:8:50:z:0

However, the DataGenerator does not like it as one of its' options:
---------
Couldn't parse the command line arguments, Found unknown option (- libjars)
at position 5
---------

I'd be happy/surprised to hear from anyone who can use the format given on the Pig wiki for the DataGenerator, in cluster mode (using -m parameter).

Any more suggestions Dmitry, and thanks for your help, it's mucho
appreciated!

Rob



2010/1/14 Dmitriy Ryaboy <[email protected]>

Sorry if I am not reading carefully enough -- but the bug report you
cite seems to indicate you want

hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
$zipfjar $datagenjar -conf $conf_file -rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

(possibly separating zipfjar and datagenjar with commas if that patch
was applied to your version of 20)

which I don't see in the list of things you tried?

-D

On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
<[email protected]> wrote:
Hi Dmitriy,

No, I do think that there was a change in 0.20.0

See the error I get:
Exception in thread "main" java.io.IOException: Error opening job jar:
-libjars

This is what I am trying to run:
hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file - rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

The $zipfjar has only one jar file in this classpath. It seems that
there
was a change to hadoop 0.20.0, not allowing for the option -libjars
immediately after "hadoop jar".

This is the extract from the Hive bug report I was talking about:
-------------


In hadoop-20 - the -libjars has to come after the jar file/class

Please try applying this patch to bin/ext/cli.sh

--- cli.sh  (revision 789726)
+++ cli.sh  (working copy)
@@ -10,7 +10,7 @@
   exit 3;
 fi

- exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS
$HIVE_OPTS "$@"
+ exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE
$HIVE_OPTS "$@"
}

----------------

I have also tried:
hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file - rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0

This gives the same error.



Rob

2010/1/14 Dmitriy Ryaboy <[email protected]>

I think the link you sent got malformatted, but try separating the
jars with a comma
http://issues.apache.org/jira/browse/HADOOP-4864

On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
<[email protected]> wrote:
Hi Dmitriy,

OK, well it seems that since 0.20.0 the order as specified on the Pig
wiki
is no longer relevant:
doop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.
DataGenerator </pig/DataGenerator> -conf $conf_file [options]
colspec...

See this patch over at Hive for 0.20.0:

http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/ <
dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com >

I have tried a few combinations, but I can't seem to fit in the
"-libjars
$zipfjar" in anywhere now.

Any ideas?

Thanks for your help.

Rob




2010/1/14 Dmitriy Ryaboy <[email protected]>

Rob,
You need to tell Hadoop which jars you need it to ship to the worker nodes. You include datagen.jar, etc, on the classpath, which makes
them discoverable locally, but you aren't telling Hadoop to ship
them.
You want to list them, comma-separated, in the -libjars parameter.

-D

On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
<[email protected]> wrote:
Hi there.

I am well underway with comparing Pig, Hive, JAQL etc...

The DataGenerator is proving a valuable tool for me. Thanks for
that.

I have one query. I am able to use it in local mode, no problem,
and
some
experiments are complete.

However, I cannot seem to use it in MapReduce mode on the cluster.
This
is
my file "generateData" contents:
------------------
export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0- core.jar export zipfjar=$HOME/installation/pig/pig-0.5.0/ sdsuLibJKD14.jar
export
datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
/usr/lib/hadoop/bin/hadoop jar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file
-m 1
-rows
10000000 -f words.dat s:8:50:z:0
------------------

The error I receive when trying to run it with "-m 1" option (in
cluster
mode):
Caused by: java.lang.ClassNotFoundException:
sdsu.algorithms.data.Zipf

So in local mode, it successfully picks up the jar file
sdsuLibJKD14.jar
,
but when running it in cluster mode, this classpath is not found?


thanks.

Rob Stewart









Reply via email to