Re: Does Pig 0.7 support indexed LZO files? If not, does elephant-pig work with 0.7?

pig Thu, 23 Sep 2010 05:42:58 -0700

Thank you!

Pig is now successfully finding the LZO libraries.


I created a pig-env.sh file in $PIG_HOME/conf (it didn't already exist)

Then added the line "export PIG_OPTS="$PIG_OPTS
-Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64"

~Ed

On Wed, Sep 22, 2010 at 6:36 PM, Gerrit van Vuuren <
gvanvuu...@specificmedia.com> wrote:

> Hi,
>
> You also need to add the java.library.path to pig opts in following to
> $PIG_HOME/bin/pig
>
> E.g:
> PIG_OPTS="$PIG_OPTS -Djava.library.path=/opt/hadoop/lib/native/Linux-amd64"
>
> ,
>
> cheers.
>
>
>
> ----- Original Message -----
> From: pig <hadoopn...@gmail.com>
> To: pig-user@hadoop.apache.org <pig-user@hadoop.apache.org>
> Sent: Wed Sep 22 23:25:58 2010
> Subject: Re: Does Pig 0.7 support indexed LZO files? If not, does
> elephant-pig work with 0.7?
>
> Hi Dimitry,
>
> Using the REGISTER pig keyword got rid of the missing class error.  Thank
> you!
>
> I still have the error regarding the lzo codec missing.
>
> I followed all the steps outlined by Gerrit and LZO works without any
> problems when I'm using it in java based map-reduce programs (including
> outputting compressed lzo files).  However, for some reason I still have
> the
> problem with Pig.  I added the hadoop-kevinweil-gpl-compression.jar to my
> $PIG_HOME/lib directory on all nodes and machine I'm running pig from.  THe
> native libraries are also in the correct location in the
> hadoop/lib/native/Linux-amd64 folder  (libgplcompression.so and
> libhadoop.so.1.0.0)
>
> I'm assuming that pig will pick up the JAVA_LIBRARY_PATH variable set in
> hadoop-env.sh.  Is that correct?  Thank you!
>
> ~Ed
>
> On Wed, Sep 22, 2010 at 5:44 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
> wrote:
>
> > By register I mean the pig register keyword.
> >
> > So, in addition to
> >
> > REGISTER elephant-bird-1.0.jar
> >
> > you should also
> >
> > REGISTER /usr/lib/elephant-pig/lib/google-collections-1.0.jar
> >
> > and possibly the rest of the jars in that directory. Might be simpler to
> > jar
> > them up together and just register a single jar.
> >
> >
> > -D
> >
> > On Wed, Sep 22, 2010 at 1:47 PM, pig <hadoopn...@gmail.com> wrote:
> >
> > > I added the jars to all my nodes in /usr/lib/elephant-pig/lib
> > >
> > > I then modified hadoop-env.sh for all nodes so that it includes the
> entry
> > >
> > >     export PIG_CLASSPATH=/usr/lib/elephant-pig/lib/*:$PIG_CLASSPATH
> > >
> > > I start up the grunt shell and first past the line:
> > >
> > >     REGISTER elephant-bird-1.0.jar
> > >
> > > This has no problems.  Then I add the line:
> > >
> > >     A = LOAD '/user/foo/input' USING
> > > com.twitter.elephantbird.pig.load.LzoTokenizedLoader('|');
> > >
> > > At this point the following error prints to screen:
> > >
> > > --------------------
> > > [main] ERROR com.hadoop.compression.lzo.GPLNativeCodeLoader - Could not
> > > load
> > > native gpl library
> > > java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
> > > ...
> > > [main] ERROR com.hadoop.compression.lzo.LzoCodec - Cannot load
> native-lzo
> > > without native-hadoop
> > > --------------------
> > >
> > > No log entry is generated and the grunt shell continues to work.  (LZO
> > > works
> > > fine with when I run java based map-reduce programs). I then add the
> > final
> > > 2
> > > lines of the pig script:
> > >
> > >     B=LIMIT A 100;
> > >     DUMP B;
> > >
> > > The program starts to execute and fails.  The nodes running the mapper
> > give
> > > the error java.lang.ClassNotFoundException:
> > com.google.common.collect.Maps
> > > and fails.  (This was the same error I was getting before in my pig log
> > > files).  The class not found exception no longer shows up in my pig log
> > > file.  In its place is a more generic RunTimeException.
> > >
> > > On all nodes I also tried
> > >
> > >     export PIG_CLASSPATH=/usr/lib/elephant-pig/lib:$PIG_CLASSPATH
> > >
> > > (without the *)
> > >
> > > and I also tried modifying JAVA_LIBRARY_PATH to include the location of
> > the
> > > elephant-pig jar files.
> > >
> > > I'm using the cloudera distro of Hadoop 0.20.2 if that might someone be
> > > causing problems.  When you said I might need to "register" the jar
> files
> > > was does that mean exactly?  Thanks again for all your assistance and
> > > prompt
> > > responses.
> > >
> > > ~Ed
> > >
> > > On Wed, Sep 22, 2010 at 3:46 PM, pig <hadoopn...@gmail.com> wrote:
> > >
> > > > Ah,
> > > >
> > > > I didn't realize I need to put the jars on all the nodes since the
> > error
> > > is
> > > > being thrown before the pig script actually executes (it's throwing
> the
> > > > error in the parsing stage).  I assumed since the pig script hasn't
> > > executed
> > > > yet it wasn't doing anything with the Hadoop nodes.
> > > >
> > > > I will try adding PIG_CLASSPATH to my hadoop-env.sh and will then put
> > the
> > > > jar files on all the slave nodes.  Hopefully that will solve the
> > problem.
> > > >
> > > > ~Ed
> > > >
> > > >
> > > > On Wed, Sep 22, 2010 at 3:28 PM, Dmitriy Ryaboy <dvrya...@gmail.com
> > > >wrote:
> > > >
> > > >> try PIG_CLASSPATH
> > > >>
> > > >> Oh and you might need to explicitly register them.. sorry, forgot.
> We
> > > just
> > > >> have them on the hadoop classpath on the nodes themselves, so we
> don't
> > > >> have
> > > >> to do that, but you might if you are starting fresh.
> > > >>
> > > >> -D
> > > >>
> > > >> On Wed, Sep 22, 2010 at 12:01 PM, pig <hadoopn...@gmail.com> wrote:
> > > >>
> > > >> > [foo]$ echo $CLASSPATH
> > > >> > :/usr/lib/elephant-bird/lib/*
> > > >> >
> > > >> > This has been set for both user foo and hadoop but I still get the
> > > same
> > > >> > error.  Is this the correct environment variable to be setting?
> > > >> >
> > > >> > Thank you!
> > > >> >
> > > >> > ~Ed
> > > >> >
> > > >> >
> > > >> > On Wed, Sep 22, 2010 at 2:46 PM, Dmitriy Ryaboy <
> dvrya...@gmail.com
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > elephant-bird/lib/* (the * is important)
> > > >> > >
> > > >> > > On Wed, Sep 22, 2010 at 11:42 AM, pig <hadoopn...@gmail.com>
> > wrote:
> > > >> > >
> > > >> > > > Well I thought that would be a simple enough fix but no luck
> so
> > > far.
> > > >> > > >
> > > >> > > > I've added the elephant-bird/lib directory (which I made world
> > > >> readable
> > > >> > > and
> > > >> > > > executable) to the CLASSPATH, JAVA_LIBRARY_PATH and
> > > HADOOP_CLASSPATH
> > > >> as
> > > >> > > > both
> > > >> > > > the user running grunt and the hadoop user. (sort of a shotgun
> > > >> > approach)
> > > >> > > >
> > > >> > > > I still get the error where it complains about
> nogplcompression
> > > and
> > > >> in
> > > >> > > the
> > > >> > > > log it has an error where it can't find
> > > >> com.google.common.collect.Maps
> > > >> > > >
> > > >> > > > Are these two separate problems or is it one problem that is
> > > causing
> > > >> > two
> > > >> > > > different errors?  Thank you for the help!
> > > >> > > >
> > > >> > > > ~Ed
> > > >> > > >
> > > >> > > > On Wed, Sep 22, 2010 at 1:57 PM, Dmitriy Ryaboy <
> > > dvrya...@gmail.com
> > > >> >
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > You need the jars in elephant-bird's lib/ on your classpath
> to
> > > run
> > > >> > > > > Elephant-Bird.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Sep 22, 2010 at 10:35 AM, pig <hadoopn...@gmail.com
> >
> > > >> wrote:
> > > >> > > > >
> > > >> > > > > > Thank you for pointing out the 0.7 branch.   I'm giving
> the
> > > 0.7
> > > >> > > branch
> > > >> > > > a
> > > >> > > > > > shot and have run into a problem when trying to run the
> > > >> following
> > > >> > > test
> > > >> > > > > pig
> > > >> > > > > > script:
> > > >> > > > > >
> > > >> > > > > > REGISTER elephant-bird-1.0.jar
> > > >> > > > > > A = LOAD '/user/foo/input' USING
> > > >> > > > > >
> com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t');
> > > >> > > > > > B = LIMIT A 100;
> > > >> > > > > > DUMP B;
> > > >> > > > > >
> > > >> > > > > > When I try to run this I get the following error:
> > > >> > > > > >
> > > >> > > > > > java.lang.UnsatisfiedLinkError: no gplcompression in
> > > >> > > java.library.path
> > > >> > > > > >  ....
> > > >> > > > > > ERROR com.hadoop.compression.lzo.LzoCodec - Cannot load
> > > >> native-lzo
> > > >> > > > > without
> > > >> > > > > > native-hadoop
> > > >> > > > > > ERROR org.apache.pig.tools.grunt.Grunt - EROR 2999:
> > Unexpected
> > > >> > > internal
> > > >> > > > > > error.  could not instantiate
> > > >> > > > > > 'com.twitter.elephantbird.pig.load.LzoTokenizedLoader'
> with
> > > >> > arguments
> > > >> > > > '[
> > > >> > > > > > ]'
> > > >> > > > > >
> > > >> > > > > > Looking at the log file it gives the following:
> > > >> > > > > >
> > > >> > > > > > java.lang.RuntimeException: could not instantiate
> > > >> > > > > > 'com.twitter.elephantbird.pig.load.LzoTokenizedLoader'
> with
> > > >> > arguments
> > > >> > > > '[
> > > >> > > > > > ]'
> > > >> > > > > > ...
> > > >> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > >> > > > > > ...
> > > >> > > > > > Caused by: java.lang.NoClassDefFoundError:
> > > >> > > > com/google/common/collect/Maps
> > > >> > > > > > ...
> > > >> > > > > > Caused by: java.lang.ClassNotFoundException:
> > > >> > > > > com.google.common.collect.Maps
> > > >> > > > > >
> > > >> > > > > > What is confusing me is that LZO compression and
> > decompression
> > > >> > works
> > > >> > > > fine
> > > >> > > > > > when I'm running a normal java based map reduce program so
> I
> > > >> feel
> > > >> > as
> > > >> > > > > though
> > > >> > > > > > the libraries have to be in the right place with the right
> > > >> settings
> > > >> > > for
> > > >> > > > > > java.library.path.  Otherwise how would normal java
> > map-reduce
> > > >> > work?
> > > >> > > >  Is
> > > >> > > > > > there some other location I need to set JAVA_LIBRARY_PATH
> > for
> > > >> pig
> > > >> > to
> > > >> > > > pick
> > > >> > > > > > it
> > > >> > > > > > up?  My understanding was that it would get this from
> > > >> > hadoop-env.sh.
> > > >> > > >  Are
> > > >> > > > > > the missing com.google.common.collect.Maps the real
> problem
> > > >> here?
> > > >> > > >  Thank
> > > >> > > > > > you
> > > >> > > > > > for any help!
> > > >> > > > > >
> > > >> > > > > > ~Ed
> > > >> > > > > >
> > > >> > > > > > On Tue, Sep 21, 2010 at 5:43 PM, Dmitriy Ryaboy <
> > > >> > dvrya...@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Ed,
> > > >> > > > > > > Elephant-bird only works with 0.6 at the moment. There's
> a
> > > >> branch
> > > >> > > for
> > > >> > > > > 0.7
> > > >> > > > > > > that I haven't tested:
> > > >> > http://github.com/hirohanin/elephant-bird/
> > > >> > > > > > > Try it, let me know if it works.
> > > >> > > > > > >
> > > >> > > > > > > -D
> > > >> > > > > > >
> > > >> > > > > > > On Tue, Sep 21, 2010 at 2:22 PM, pig <
> > hadoopn...@gmail.com>
> > > >> > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hello,
> > > >> > > > > > > >
> > > >> > > > > > > > I have a small cluster up and running with LZO
> > compressed
> > > >> files
> > > >> > > in
> > > >> > > > > it.
> > > >> > > > > > >  I'm
> > > >> > > > > > > > using the lzo compression libraries available at
> > > >> > > > > > > > http://github.com/kevinweil/hadoop-lzo (thank you for
> > > >> > > maintaining
> > > >> > > > > > this!)
> > > >> > > > > > > >
> > > >> > > > > > > > So far everything works fine when I write regular
> > > map-reduce
> > > >> > > jobs.
> > > >> > > >  I
> > > >> > > > > > can
> > > >> > > > > > > > read in lzo files and write out lzo files without any
> > > >> problem.
> > > >> > > > > > > >
> > > >> > > > > > > > I'm also using Pig 0.7 and it appears to be able to
> read
> > > LZO
> > > >> > > files
> > > >> > > > > out
> > > >> > > > > > of
> > > >> > > > > > > > the box using the default LoadFunc (PigStorage).
> >  However,
> > > I
> > > >> am
> > > >> > > > > > currently
> > > >> > > > > > > > testing a large LZO file (20GB) which I indexed using
> > the
> > > >> > > > LzoIndexer
> > > >> > > > > > and
> > > >> > > > > > > > Pig
> > > >> > > > > > > > does not appear to be making use of the indexes.  The
> > pig
> > > >> > scripts
> > > >> > > > > that
> > > >> > > > > > > I've
> > > >> > > > > > > > run so far only have 3 mappers when processing the
> 20GB
> > > >> file.
> > > >> >  My
> > > >> > > > > > > > understanding was that there should be 1 map for each
> > > block
> > > >> > > (256MB
> > > >> > > > > > > blocks)
> > > >> > > > > > > > so about 80 mappers when processing the 20GB lzo file.
> > >  Does
> > > >> > Pig
> > > >> > > > 0.7
> > > >> > > > > > > > support
> > > >> > > > > > > > indexed lzo files with the default load function?
> > > >> > > > > > > >
> > > >> > > > > > > > If not, I was looking at elephant-bird and noticed it
> is
> > > >> only
> > > >> > > > > > compatible
> > > >> > > > > > > > with Pig 0.6 and not 0.7+  Is that accurate?  What
> would
> > > be
> > > >> the
> > > >> > > > > > > recommended
> > > >> > > > > > > > solution for processing index lzo files using Pig 0.7.
> > > >> > > > > > > >
> > > >> > > > > > > > Thank you for any assistance!
> > > >> > > > > > > >
> > > >> > > > > > > > ~Ed
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Does Pig 0.7 support indexed LZO files? If not, does elephant-pig work with 0.7?

Reply via email to