Bennie Schut
Tue, 17 Nov 2009 23:59:43 -0800
Hi all, I still can't get pig to use multiple mappers when using zebra. I tried using lzo hoping it would help but sadly no. The file is 14G tab delimited plain text and when using zebra with gz 7G and with lzo 10G. When I use the tab delimited file I get 216 mappers but with zebra just 2 mappers of which 1 mapper is done almost instantly and the other runs for hours. Any idea why it's not using more mappers?
As an example of what I'm trying to do:
dim1258375560540 = load '/user/dwh/screenname2.zebra' using
org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
fact1258375560540 = load
'/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
PigStorage('\t') as (session_hash: chararray, email: chararray,
screenname: chararray);
tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
dim1258375560540 by code outer PARALLEL 4;
dump tmp1258375560540;
Thanks,
Bennie
Bennie Schut wrote:
> Another zebra related question.
>
> I couldn't find a lot of documentation on zebra but I figured you can
> change compression codec with a syntax like this:
> store outfile into '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>
> And in theory disable compression like this:
> store outfile into '/user/dwh/screenname3.zebra' using
> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>
> But it doesn't seem to understand the "none" as a compressor.
> java.io.IOException: ColumnGroup.Writer constructor failed : Partition
> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
> column 13.
> Was
> expecting:
>
>
>
> <COMPRESSOR>
> ...
>
>
>
>
>
>
> at
> org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)
>
>
> at
> org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)
>
>
> at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>
>
> at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>
>
> at
> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
>
>
> at
> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
>
>
> at
> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
>
>
> at
> java.lang.Thread.run(Thread.java:619)
>
>
>
>
> I actually tried this because when I use the zebra result on further
> processing it only uses 2 mappers instead of the 230 mappers on the
> original file. I remember hadoop can not split gz files so I figured
> using compression might cause it to use so little mappers. Anyone
> perhaps know a different approach on this?
>
> Thanks,
> Bennie.
>
>