Just for anyone else who might have this problem in the future.
I found a bit of a workaround. When you generate the zebra file/dir just
make sure you use something like a "order x by y parallel 20;" before
you do the store so in the zebra structure it will have 20 files and any
jobs using this file can then at least use 20 mappers.
It's not perfect though so if someone finds another way please let me know.
Using zebra this way gives me a 27% speed improvement over using plain
text tab delimited files so even with the hack I'm happy :)

Bennie Schut wrote:
> Hi all,
>
> I still can't get pig to use multiple mappers when using zebra. I tried
> using lzo hoping it would help but sadly no. The file is 14G tab
> delimited plain text and when using zebra with gz 7G and with lzo 10G.
> When I use the tab delimited file I get 216 mappers but with zebra just
> 2 mappers of which 1 mapper is done almost instantly and the other runs
> for hours. Any idea why it's not using more mappers?
>
> As an example of what I'm trying to do:
> dim1258375560540 = load '/user/dwh/screenname2.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/bennies/newvalues//chatsessions_1238624404177_small.csv' using
> PigStorage('\t') as (session_hash: chararray, email: chararray,
> screenname: chararray);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 4;
> dump tmp1258375560540;
>
>
> Thanks,
> Bennie
>
> Bennie Schut wrote:
>   
>> Another zebra related question.
>>
>> I couldn't find a lot of documentation on zebra but I figured you can
>> change compression codec with a syntax like this:
>> store outfile into '/user/dwh/screenname2.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by lzo');
>>
>> And in theory disable compression like this:
>> store outfile into '/user/dwh/screenname3.zebra' using
>> org.apache.hadoop.zebra.pig.TableStorer('compress by none');
>>
>> But it doesn't seem to understand the "none" as a compressor.
>> java.io.IOException: ColumnGroup.Writer constructor failed : Partition
>> constructor failed :Encountered " <IDENTIFIER> "none "" at line 1,
>> column 13.              
>> Was
>> expecting:                                                                   
>>                                                                              
>>     
>>
>>     <COMPRESSOR>
>> ...                                                                          
>>                                                                     
>>
>>                                                                              
>>                                                                              
>>         
>>
>>         at
>> org.apache.hadoop.zebra.io.BasicTable$Writer.<init>(BasicTable.java:1116)    
>>                                                                           
>>
>>         at
>> org.apache.hadoop.zebra.pig.TableOutputFormat.checkOutputSpecs(TableStorer.java:154)
>>                                                                    
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)     
>>                                                                           
>>
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)             
>>                                                                           
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)                 
>>                                                                           
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
>>                                                                      
>>
>>         at
>> org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)      
>>                                                                           
>>
>>         at
>> java.lang.Thread.run(Thread.java:619)                                        
>>                                                                           
>>
>>
>>
>> I actually tried this because when I use the zebra result on further
>> processing it only uses 2 mappers instead of the 230 mappers on the
>> original file. I remember hadoop can not split gz files so I figured
>> using compression might cause it to use so little mappers. Anyone
>> perhaps know a different approach on this?
>>
>> Thanks,
>> Bennie.
>>
>>   
>>     
>
>   

Reply via email to