pig-user  

Newbie Questions - Enlighten me please

Andre Philippi
Mon, 14 Jul 2008 13:47:50 -0700

Hi,

I started learning and working with Pig 3 days ago and I have been able to
make it work with my EC2/HDFS environment and parse some files. Yay! :)

Now, after reading all of the docs on the wiki, I still have a few questions
that I was hoping someone here could help me with. I was also thinking that
the answers to these basic questions could be re-used on a HOWTO doc for new
Pig users.

1) Is there a way to know the number of positional fields returned after a
"foo = load 'bar' using PigStorage();" ?

Ex.: $#


2) Is there a way to test the existence of a particular positional field?

Ex.: "SPLIT A INTO X IF $3; Y if !$3" or something like that :)


3) How can I further split a field after a load?

Ex.:

foo = load "apache_logs" using PigStorage(' ') as (timestamp, ip, url.
query_string);
bar = split foo using ?????('&') as (anon_id, user_action, age, sex
location);

P.S.: So far I'm hacking this by doing the following to accomplish what I
want, but I'm sure (hope) there's a better way:

log = load '/data/logs/apache.log' using PigStorage(',') as
(time,ip,url,query);
queries = foreach log generate query;
store queries into '/data/out/pig/queries' using PigStorage();
split_queries = load '/data/out/pig/queries/*' using PigStorage('&');


4) Apache Logs:

I saw the UDF expandQuery function referenced on the sigmod08.pdf, but since
I'm not being able to compile the piggybank, I'm not sure if it's really
available.

So, my question is... Has anyone written any UDFs for parsing apache logs
yet?

I'm sure a lot of Pig users would be interested in such functions...


5) Pig Streaming:

I saw the following code snippet at the PigPerformance doc:

IP = load '/pig/in';
FILTERED_DATA = filter IP by $1 > '0';
OP = stream IP through `perl -ne 'print $_;'`;
store OP into '/pig/out';

However, I'm still a bit puzzled with the usage of the stream feature.

Could someone be kind enough to give a better (more practical) example (or
share some working code) of using the streaming functionality?

Ex.:

OP = stream IP through `perl -ne 'print $_ if =~
/(\d{3}).(\d{3}).(\d{3}).(\d{3})/;'`;

5.1) Would something like that work?
5.2) What would be the pros/cons of "transfering" some of the regex parsing
to perl (similar to the code above) taking in consideration Pig/MapReduce
and the distributed architecture?

Thank you in advance for any help on the isues above,

Best Regards,

Andre Philippi