Re: Does Pig 0.7 support indexed LZO files? If not, does elephant-pig work with 0.7?

Gerrit van Vuuren Wed, 22 Sep 2010 14:25:53 -0700

Hi,

The lzo codec needs the native libraries to be installed on all hadoop nodes 
plus the java to c bindings.


To make lzo work do the following steps:
1. Download kevin's branch of lzo from github.
2. Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll 
need to compile this on a 64bit machine if your cluster uses 64bit.
3.Copy the native libs to $HADOOP_HOME/lib/native/Linux-amd64 on all you hadoop 
nodes plus the client that your running pig from.
4. Copy the java jar to all nodes and client to $HADOOP_HOME/lib and 
$PIG_HOME/lib
5. Then configure hadoop on all nodes plus the client and pig to have the 
property java.library.path point to the lzo native libraries. E.g
-Djava.library.path=/opt/hadoop/lib/native/Linux-amd64
6. Add the lzo codec class name to the codecs property in the hadoop 
configuraion.
8. Install lzo (lzo lib) on all hadoop nodes e.g. For centos do yum install lzo
9. Restart the cluster.

The above steps gives you the LZOCodec fully configured and available on hadoop 
with java knowing where to find the lzo native libs.

Hope this helps,



----- Original Message -----
From: pig <hadoopn...@gmail.com>
To: pig-user@hadoop.apache.org <pig-user@hadoop.apache.org>
Sent: Wed Sep 22 21:47:23 2010
Subject: Re: Does Pig 0.7 support indexed LZO files? If not, does elephant-pig 
work with 0.7?

I added the jars to all my nodes in /usr/lib/elephant-pig/lib

I then modified hadoop-env.sh for all nodes so that it includes the entry

     export PIG_CLASSPATH=/usr/lib/elephant-pig/lib/*:$PIG_CLASSPATH

I start up the grunt shell and first past the line:

     REGISTER elephant-bird-1.0.jar

This has no problems.  Then I add the line:

     A = LOAD '/user/foo/input' USING
com.twitter.elephantbird.pig.load.LzoTokenizedLoader('|');

At this point the following error prints to screen:

--------------------
[main] ERROR com.hadoop.compression.lzo.GPLNativeCodeLoader - Could not load
native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
...
[main] ERROR com.hadoop.compression.lzo.LzoCodec - Cannot load native-lzo
without native-hadoop
--------------------

No log entry is generated and the grunt shell continues to work.  (LZO works
fine with when I run java based map-reduce programs). I then add the final 2
lines of the pig script:

     B=LIMIT A 100;
     DUMP B;

The program starts to execute and fails.  The nodes running the mapper give
the error java.lang.ClassNotFoundException: com.google.common.collect.Maps
and fails.  (This was the same error I was getting before in my pig log
files).  The class not found exception no longer shows up in my pig log
file.  In its place is a more generic RunTimeException.

On all nodes I also tried

     export PIG_CLASSPATH=/usr/lib/elephant-pig/lib:$PIG_CLASSPATH

(without the *)

and I also tried modifying JAVA_LIBRARY_PATH to include the location of the
elephant-pig jar files.

I'm using the cloudera distro of Hadoop 0.20.2 if that might someone be
causing problems.  When you said I might need to "register" the jar files
was does that mean exactly?  Thanks again for all your assistance and prompt
responses.

~Ed

On Wed, Sep 22, 2010 at 3:46 PM, pig <hadoopn...@gmail.com> wrote:

> Ah,
>
> I didn't realize I need to put the jars on all the nodes since the error is
> being thrown before the pig script actually executes (it's throwing the
> error in the parsing stage).  I assumed since the pig script hasn't executed
> yet it wasn't doing anything with the Hadoop nodes.
>
> I will try adding PIG_CLASSPATH to my hadoop-env.sh and will then put the
> jar files on all the slave nodes.  Hopefully that will solve the problem.
>
> ~Ed
>
>
> On Wed, Sep 22, 2010 at 3:28 PM, Dmitriy Ryaboy <dvrya...@gmail.com>wrote:
>
>> try PIG_CLASSPATH
>>
>> Oh and you might need to explicitly register them.. sorry, forgot. We just
>> have them on the hadoop classpath on the nodes themselves, so we don't
>> have
>> to do that, but you might if you are starting fresh.
>>
>> -D
>>
>> On Wed, Sep 22, 2010 at 12:01 PM, pig <hadoopn...@gmail.com> wrote:
>>
>> > [foo]$ echo $CLASSPATH
>> > :/usr/lib/elephant-bird/lib/*
>> >
>> > This has been set for both user foo and hadoop but I still get the same
>> > error.  Is this the correct environment variable to be setting?
>> >
>> > Thank you!
>> >
>> > ~Ed
>> >
>> >
>> > On Wed, Sep 22, 2010 at 2:46 PM, Dmitriy Ryaboy <dvrya...@gmail.com>
>> > wrote:
>> >
>> > > elephant-bird/lib/* (the * is important)
>> > >
>> > > On Wed, Sep 22, 2010 at 11:42 AM, pig <hadoopn...@gmail.com> wrote:
>> > >
>> > > > Well I thought that would be a simple enough fix but no luck so far.
>> > > >
>> > > > I've added the elephant-bird/lib directory (which I made world
>> readable
>> > > and
>> > > > executable) to the CLASSPATH, JAVA_LIBRARY_PATH and HADOOP_CLASSPATH
>> as
>> > > > both
>> > > > the user running grunt and the hadoop user. (sort of a shotgun
>> > approach)
>> > > >
>> > > > I still get the error where it complains about nogplcompression and
>> in
>> > > the
>> > > > log it has an error where it can't find
>> com.google.common.collect.Maps
>> > > >
>> > > > Are these two separate problems or is it one problem that is causing
>> > two
>> > > > different errors?  Thank you for the help!
>> > > >
>> > > > ~Ed
>> > > >
>> > > > On Wed, Sep 22, 2010 at 1:57 PM, Dmitriy Ryaboy <dvrya...@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > You need the jars in elephant-bird's lib/ on your classpath to run
>> > > > > Elephant-Bird.
>> > > > >
>> > > > >
>> > > > > On Wed, Sep 22, 2010 at 10:35 AM, pig <hadoopn...@gmail.com>
>> wrote:
>> > > > >
>> > > > > > Thank you for pointing out the 0.7 branch.   I'm giving the 0.7
>> > > branch
>> > > > a
>> > > > > > shot and have run into a problem when trying to run the
>> following
>> > > test
>> > > > > pig
>> > > > > > script:
>> > > > > >
>> > > > > > REGISTER elephant-bird-1.0.jar
>> > > > > > A = LOAD '/user/foo/input' USING
>> > > > > > com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t');
>> > > > > > B = LIMIT A 100;
>> > > > > > DUMP B;
>> > > > > >
>> > > > > > When I try to run this I get the following error:
>> > > > > >
>> > > > > > java.lang.UnsatisfiedLinkError: no gplcompression in
>> > > java.library.path
>> > > > > >  ....
>> > > > > > ERROR com.hadoop.compression.lzo.LzoCodec - Cannot load
>> native-lzo
>> > > > > without
>> > > > > > native-hadoop
>> > > > > > ERROR org.apache.pig.tools.grunt.Grunt - EROR 2999: Unexpected
>> > > internal
>> > > > > > error.  could not instantiate
>> > > > > > 'com.twitter.elephantbird.pig.load.LzoTokenizedLoader' with
>> > arguments
>> > > > '[
>> > > > > > ]'
>> > > > > >
>> > > > > > Looking at the log file it gives the following:
>> > > > > >
>> > > > > > java.lang.RuntimeException: could not instantiate
>> > > > > > 'com.twitter.elephantbird.pig.load.LzoTokenizedLoader' with
>> > arguments
>> > > > '[
>> > > > > > ]'
>> > > > > > ...
>> > > > > > Caused by: java.lang.reflect.InvocationTargetException
>> > > > > > ...
>> > > > > > Caused by: java.lang.NoClassDefFoundError:
>> > > > com/google/common/collect/Maps
>> > > > > > ...
>> > > > > > Caused by: java.lang.ClassNotFoundException:
>> > > > > com.google.common.collect.Maps
>> > > > > >
>> > > > > > What is confusing me is that LZO compression and decompression
>> > works
>> > > > fine
>> > > > > > when I'm running a normal java based map reduce program so I
>> feel
>> > as
>> > > > > though
>> > > > > > the libraries have to be in the right place with the right
>> settings
>> > > for
>> > > > > > java.library.path.  Otherwise how would normal java map-reduce
>> > work?
>> > > >  Is
>> > > > > > there some other location I need to set JAVA_LIBRARY_PATH for
>> pig
>> > to
>> > > > pick
>> > > > > > it
>> > > > > > up?  My understanding was that it would get this from
>> > hadoop-env.sh.
>> > > >  Are
>> > > > > > the missing com.google.common.collect.Maps the real problem
>> here?
>> > > >  Thank
>> > > > > > you
>> > > > > > for any help!
>> > > > > >
>> > > > > > ~Ed
>> > > > > >
>> > > > > > On Tue, Sep 21, 2010 at 5:43 PM, Dmitriy Ryaboy <
>> > dvrya...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Ed,
>> > > > > > > Elephant-bird only works with 0.6 at the moment. There's a
>> branch
>> > > for
>> > > > > 0.7
>> > > > > > > that I haven't tested:
>> > http://github.com/hirohanin/elephant-bird/
>> > > > > > > Try it, let me know if it works.
>> > > > > > >
>> > > > > > > -D
>> > > > > > >
>> > > > > > > On Tue, Sep 21, 2010 at 2:22 PM, pig <hadoopn...@gmail.com>
>> > wrote:
>> > > > > > >
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > > I have a small cluster up and running with LZO compressed
>> files
>> > > in
>> > > > > it.
>> > > > > > >  I'm
>> > > > > > > > using the lzo compression libraries available at
>> > > > > > > > http://github.com/kevinweil/hadoop-lzo (thank you for
>> > > maintaining
>> > > > > > this!)
>> > > > > > > >
>> > > > > > > > So far everything works fine when I write regular map-reduce
>> > > jobs.
>> > > >  I
>> > > > > > can
>> > > > > > > > read in lzo files and write out lzo files without any
>> problem.
>> > > > > > > >
>> > > > > > > > I'm also using Pig 0.7 and it appears to be able to read LZO
>> > > files
>> > > > > out
>> > > > > > of
>> > > > > > > > the box using the default LoadFunc (PigStorage).  However, I
>> am
>> > > > > > currently
>> > > > > > > > testing a large LZO file (20GB) which I indexed using the
>> > > > LzoIndexer
>> > > > > > and
>> > > > > > > > Pig
>> > > > > > > > does not appear to be making use of the indexes.  The pig
>> > scripts
>> > > > > that
>> > > > > > > I've
>> > > > > > > > run so far only have 3 mappers when processing the 20GB
>> file.
>> >  My
>> > > > > > > > understanding was that there should be 1 map for each block
>> > > (256MB
>> > > > > > > blocks)
>> > > > > > > > so about 80 mappers when processing the 20GB lzo file.  Does
>> > Pig
>> > > > 0.7
>> > > > > > > > support
>> > > > > > > > indexed lzo files with the default load function?
>> > > > > > > >
>> > > > > > > > If not, I was looking at elephant-bird and noticed it is
>> only
>> > > > > > compatible
>> > > > > > > > with Pig 0.6 and not 0.7+  Is that accurate?  What would be
>> the
>> > > > > > > recommended
>> > > > > > > > solution for processing index lzo files using Pig 0.7.
>> > > > > > > >
>> > > > > > > > Thank you for any assistance!
>> > > > > > > >
>> > > > > > > > ~Ed
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Does Pig 0.7 support indexed LZO files? If not, does elephant-pig work with 0.7?

Reply via email to