Problem with LzoTokenizedLoader with elephant-bird branch for Pig 0.7

pig Thu, 23 Sep 2010 06:50:02 -0700

Hello,

After getting all the errors to go away with LZO libraries not being found
and missing jar files for elephant-bird I've run into a new problem when
using the elephant-bird branch for pig 0.7


The following simple pig script works as expected

     REGISTER elephant-bird-1.0.jar
     REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar
     A = load '/usr/foo/input/test_input_chars.txt';
     DUMP A;

This just dumps out the contents of the test_input_chars.txt file which is
tab delimited. The output looks like:

     (1,a,a,a,a,a,a)
     (2,b,b,b,b,b,b)
     (3,c,c,c,c,c,c)
     (4,d,d,d,d,d,d)
     (5,e,e,e,e,e,e)

I then lzop the test file to get test_input_chars.txt.lzo (I decompressed
this with lzop -d to make sure the compression worked fine and everything
looks good).
If I run the exact same script provided above on the lzo file it works
fine.  However, this file is really small and doesn't need to use indexes.
As a result, I wanted to
have LZO support that worked with indexes.  Based on this I decided to try
out the elephant-bird branch for pig 0.7 located here (
http://github.com/hirohanin/elephant-bird/) as
recommended by Dimitriy.

I created the following pig script that mirrors the above script but should
hopefully work on LZO files (including indexed ones)

     REGISTER elephant-bird-1.0.jar
     REGISTER /usr/lib/elephant-bird/lib/google-collect-1.0.jar
     A = load '/usr/foo/input/test_input_chars.txt.lzo' USING
com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t');
     DUMP A;

When I run this script which uses the LzoTokenizedLoader there is no
output.  The script appears to run without errors but there are zero Records
Written and 0 Bytes Written.

Here is the exact output:

grunt > DUMP A;
[main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
LzoTokenizedLoader with given delimited [     ]
[main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
LzoTokenizedLoader with given delimited [     ]
[main] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
LzoTokenizedLoader with given delimited [     ]
[main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
(Name:
Store(hdfs://master:9000/tmp/temp-2052828736/tmp-1533645117:org.apache.pig.builtin.BinStorage)
- 1-4 Operator Key: 1-4
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 1 map-reduce job(s) waiting for submission.
[Thread-12] WARN org.apache.hadoop.mapred.JobClient - Use
GenericOptionsParser for parsing the arguments.  Applications should
implement Tool for the same.
[Thread-12] INFO com.twitter.elephantbird.pig.load.LzoTokenizedLoader -
LzoTokenizedLoader with given delimiter [     ]
[Thread-12] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
Total input paths to process : 1
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 0% complete
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_201009101108_0151
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- More information at
http://master:50030/jobdetails.jsp?jobid=job_201009101108_0151
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 50% complete
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- 100% complete
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Succesfully stored result in
"hdfs://amb-hadoop-01:9000/tmp/temp-2052828736/tmp-1533645117
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Records written: 0
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Bytes written: 0
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Spillable Memory Manager spill count : 0
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Proactive spill count : 0
[main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- Success!
[main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total
input paths to process: 1
[main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -
Total input paths to process: 1
grunt >

I'm not sure if I'm doing something wrong in my use of LzoTokenizedLoader or
if there is a problem with the class itself (most likely the problem is with
my code heh)  Thank you for any help!

~Ed

Problem with LzoTokenizedLoader with elephant-bird branch for Pig 0.7

Reply via email to