Rob,
Feel free to update the wiki with your findings. You don't have to be
a committer to change the wiki.
Alan.
On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote:
Hello Dmitry!
I have it solved, it was just a bit of trial and error based on the
Hive bug
report/fix I found.
The report is indeed correct, the following works:
hadoop jar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator
-libjars $zipfjar -conf $conf_file -rows 10000000 -m 3 -f
/scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
This puts the Pig wiki out of date for Hadoop 0.20, but is still
relevant
for Hadoop 0.18 and less.
May I propose that you update the wiki as such:
------------------------
DataGenerator Usage:
For 0.18.0
hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator </pig/DataGenerator>
-conf
$conf_file [options] colspec...
For 0.20.0
hadoop jar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator</pig/
DataGenerator> -libjars
$zipfjar -conf $conf_file [options] colspec...
--------------
Sound OK ?
Rob Stewart
2010/1/14 Rob Stewart <[email protected]>
Yeah, unfortunately your suggestion does not work, and neither does
the
order given on the Pig wiki. Instead, see the Hadoop wiki for -
libjars
usage:
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -
libjars
mylib.jar input output
So I tried this:
hadoop jar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator
-conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/
wordsx1_skewed.dat
-libjars $zipfjar s:8:50:z:0
However, the DataGenerator does not like it as one of its' options:
---------
Couldn't parse the command line arguments, Found unknown option (-
libjars)
at position 5
---------
I'd be happy/surprised to hear from anyone who can use the format
given on
the Pig wiki for the DataGenerator, in cluster mode (using -m
parameter).
Any more suggestions Dmitry, and thanks for your help, it's mucho
appreciated!
Rob
2010/1/14 Dmitriy Ryaboy <[email protected]>
Sorry if I am not reading carefully enough -- but the bug report you
cite seems to indicate you want
hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars
$zipfjar $datagenjar -conf $conf_file -rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
(possibly separating zipfjar and datagenjar with commas if that
patch
was applied to your version of 20)
which I don't see in the list of things you tried?
-D
On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart
<[email protected]> wrote:
Hi Dmitriy,
No, I do think that there was a change in 0.20.0
See the error I get:
Exception in thread "main" java.io.IOException: Error opening job
jar:
-libjars
This is what I am trying to run:
hadoop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -
rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
The $zipfjar has only one jar file in this classpath. It seems that
there
was a change to hadoop 0.20.0, not allowing for the option -libjars
immediately after "hadoop jar".
This is the extract from the Hive bug report I was talking about:
-------------
In hadoop-20 - the -libjars has to come after the jar file/class
Please try applying this patch to bin/ext/cli.sh
--- cli.sh (revision 789726)
+++ cli.sh (working copy)
@@ -10,7 +10,7 @@
exit 3;
fi
- exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar
$CLASS
$HIVE_OPTS "$@"
+ exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS
$AUX_JARS_CMD_LINE
$HIVE_OPTS "$@"
}
----------------
I have also tried:
hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -
rows
10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0
This gives the same error.
Rob
2010/1/14 Dmitriy Ryaboy <[email protected]>
I think the link you sent got malformatted, but try separating the
jars with a comma
http://issues.apache.org/jira/browse/HADOOP-4864
On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart
<[email protected]> wrote:
Hi Dmitriy,
OK, well it seems that since 0.20.0 the order as specified on
the Pig
wiki
is no longer relevant:
doop jar -libjars $zipfjar $datagenjar
org.apache.pig.test.utils.datagen.
DataGenerator </pig/DataGenerator> -conf $conf_file [options]
colspec...
See this patch over at Hive for 0.20.0:
http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/
<
dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com
>
I have tried a few combinations, but I can't seem to fit in the
"-libjars
$zipfjar" in anywhere now.
Any ideas?
Thanks for your help.
Rob
2010/1/14 Dmitriy Ryaboy <[email protected]>
Rob,
You need to tell Hadoop which jars you need it to ship to the
worker
nodes. You include datagen.jar, etc, on the classpath, which
makes
them discoverable locally, but you aren't telling Hadoop to ship
them.
You want to list them, comma-separated, in the -libjars
parameter.
-D
On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart
<[email protected]> wrote:
Hi there.
I am well underway with comparing Pig, Hive, JAQL etc...
The DataGenerator is proving a valuable tool for me. Thanks for
that.
I have one query. I am able to use it in local mode, no
problem,
and
some
experiments are complete.
However, I cannot seem to use it in MapReduce mode on the
cluster.
This
is
my file "generateData" contents:
------------------
export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-
core.jar
export zipfjar=$HOME/installation/pig/pig-0.5.0/
sdsuLibJKD14.jar
export
datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar
export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml
export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar
/usr/lib/hadoop/bin/hadoop jar $datagenjar
org.apache.pig.test.utils.datagen.DataGenerator -conf
$conf_file
-m 1
-rows
10000000 -f words.dat s:8:50:z:0
------------------
The error I receive when trying to run it with "-m 1" option
(in
cluster
mode):
Caused by: java.lang.ClassNotFoundException:
sdsu.algorithms.data.Zipf
So in local mode, it successfully picks up the jar file
sdsuLibJKD14.jar
,
but when running it in cluster mode, this classpath is not
found?
thanks.
Rob Stewart