pig-user  

Re: zebra: unknown compressor none

Ashutosh Chauhan
Wed, 18 Nov 2009 18:13:37 -0800

Hi Bennie,

So, you are using Zebra for its out of box serialization and
compression support. Thanks, for the explanation.

Ashutosh
On Wed, Nov 18, 2009 at 10:43, Bennie Schut <bsc...@ebuddy.com> wrote:
> Hi Ashutosh,
>
> There are only 2 columns in the original file and in the zebra file and
> this is how I use it:
>
> the screenname file contains 2 fields a number and a string and is 14G
> in size, after transforming it into zebra 10G internally split into 80
> files.
> the chatsession file contains many fields both numeric and string and is
> 155M in size.
>
> register zebra-0.6.0-dev.jar;
> dim1258375560540 = load '/user/dwh/screenname_lzo_80.zebra' using
> org.apache.hadoop.zebra.pig.TableLoader('screenname_id, code');
> fact1258375560540 = load
> '/user/dwh/chatsessions/chatsessionsmap/output/chatsessions_1258534806969_0.csv'
> using PigStorage('\t') as (session_hash: chararray, email: chararray,
> refer_url: chararray, version: chararray, protocol: chararray,
> logintype: chararray, frontendversion: chararray, remote_ip: chararray,
> country: chararray, server_id, login_date, login_time, success,
> end_date, end_time, msg_sent, avg_msg_sent_size, msg_rcv,
> avg_msg_rcv_size, num_contacts, num_groups, num_sessions, secure_login,
> timeout, has_picture, screenname: chararray, useragent :chararray,
> error_code, masterlogin: chararray, unused :int);
> tmp1258375560540 = cogroup fact1258375560540 by screenname inner,
> dim1258375560540 by code outer PARALLEL 10;
> tmp12583755605401 = filter tmp1258375560540 by IsEmpty( dim1258375560540);
> tmp12583755605402 = foreach tmp12583755605401 generate
> flatten(fact1258375560540.screenname);
> tmp12583755605403 = distinct tmp12583755605402 PARALLEL 4;
> dump tmp12583755605403;
>
>
> It's basically trying to see if there are new values for screenname in
> the chatsessions file which are not in the screenname file.
> in sql it would be something like:
> select l.screenname
> from etl.chatsessions l
>  left join etl.screenname sn on (sn.screenname = l.screenname)
> where sn.screenname is null;
>
> In sql the screenname_id field is a numeric field so it's only a couple
> of bytes per record but on the plain text file it's many bytes per
> record I guess that's where the whole types branch was trying to solve
> at least internally however on hdfs the input and output are still many
> bytes.
> I was looking for a way to serialize the text file so these numbers
> would only be a few bytes and then found zebra which pretty much will do
> this for you.
> My hunch was when I would reduce the size it would gain a little
> performance simple because of copy speed.
> You probably get similar result if you would manually use a
> serialization+compression however that's a lot of work.
>
> I'm still going to try and produce a zebra file with the same number of
> mappers as the original text file would cause to make sure the speed
> difference isn't caused by more work per mapper being done.
>