Re: Using PIG for processing Chuckwa files

Vincent Barat Fri, 05 Feb 2010 00:15:03 -0800

Thank you very much for your detailed answer. I'm going toinvestigate this today. It seems that it can fit my needs:

I currently develop an analytic system that gather logs fromcellphones and do some computation on them using PIG. Currently, weuse HBase to:

- store our logs in big tables in a structured way (an equivalent tothe ChukwaRecord)- allow several log writers to write in the same table at the sametime (an equivalent to the Chukwa's agents and collectors)

- remove potentially duplicated logs (by using HBase key)

We face several issues with HBase:

- we are enable to setup a HBase cluster that is reliable (HBasefails very often and we are obliged to restart all our regionservers each time it happens)- HBase consumes too much memory compared to a more simpleHadoop/HDFS only solution (this requires to use very expensivemachine for our cluster nodes)- HBase loader for PIG is way too slow (x10 slower) compared toother PIG loader (BinStorage or PigStorage). This forces us to firstload data from HBase and write them to regular HDFS files using PIGbefore computing the statistics.

So I currently investigate alternative solutions to HBase that fitsour need.


Le 04/02/10 19:31, Corbin Hoenes a écrit :

Vincent,

Yes there is Pig support.  I am just learning how to use it but with some help 
from people on this list have been successful in using Pig to analyze chukwa 
collected logs.

In  ${CHUKWA_HOME}/contrib/chukwa-pig/ you'll have a chukwa-pig.jar which 
contains the ChukwaStorage loader.

Once you have that you can use it like this:

register /[your chukwa path here]/chukwa-core-0.3.0.jar
register /[your udf path here]/lib/chukwa-pig.jar

records = LOAD '$in_file' using org.apache.hadoop.chukwa.ChukwaStorage() as 
(ts:long, fields);
named_records = FOREACH records GENERATE fields#'URI' as 
uri,fields#'RECORD_TYPE' as type,fields#'CLIENT_IP_ADDRESS' as ip;
dump named_records;

Chukwa files are sequence file format that uses a "ChukwaRecord" which are 
key,value pairs.  You can organize your data in the ChukwaRecords in a custom format if 
needed by using a Custom Processor for your data type.  Example above shows a bunch of 
custom fields like URI that were parse out of the log files via a processor.  This can 
make it a bit easier for your pig scripts to get data out.


On Feb 4, 2010, at 7:24 AM, Vincent Barat wrote:

Hello,

I'm currently evaluating Chuckwa and I wonder if there is a way to use PIG to 
map/reduce the files produced by Chuckwa?

If yes, is there a special PIG loader to use?

What is the format of Chuckwa files? Is it just a concatenation of all logs 
sent by the agents?

Thanks for your help.
<vincent_barat.vcf>

<<attachment: vincent_barat.vcf>>

Re: Using PIG for processing Chuckwa files

Reply via email to