Thank you very much for your detailed answer. I'm going to investigate this today. It seems that it can fit my needs:

I currently develop an analytic system that gather logs from cellphones and do some computation on them using PIG. Currently, we use HBase to:

- store our logs in big tables in a structured way (an equivalent to the ChukwaRecord) - allow several log writers to write in the same table at the same time (an equivalent to the Chukwa's agents and collectors)
- remove potentially duplicated logs (by using HBase key)

We face several issues with HBase:

- we are enable to setup a HBase cluster that is reliable (HBase fails very often and we are obliged to restart all our region servers each time it happens) - HBase consumes too much memory compared to a more simple Hadoop/HDFS only solution (this requires to use very expensive machine for our cluster nodes) - HBase loader for PIG is way too slow (x10 slower) compared to other PIG loader (BinStorage or PigStorage). This forces us to first load data from HBase and write them to regular HDFS files using PIG before computing the statistics.

So I currently investigate alternative solutions to HBase that fits our need.

Le 04/02/10 19:31, Corbin Hoenes a écrit :
Vincent,

Yes there is Pig support.  I am just learning how to use it but with some help 
from people on this list have been successful in using Pig to analyze chukwa 
collected logs.

In  ${CHUKWA_HOME}/contrib/chukwa-pig/ you'll have a chukwa-pig.jar which 
contains the ChukwaStorage loader.

Once you have that you can use it like this:

register /[your chukwa path here]/chukwa-core-0.3.0.jar
register /[your udf path here]/lib/chukwa-pig.jar

records = LOAD '$in_file' using org.apache.hadoop.chukwa.ChukwaStorage() as 
(ts:long, fields);
named_records = FOREACH records GENERATE fields#'URI' as 
uri,fields#'RECORD_TYPE' as type,fields#'CLIENT_IP_ADDRESS' as ip;
dump named_records;

Chukwa files are sequence file format that uses a "ChukwaRecord" which are 
key,value pairs.  You can organize your data in the ChukwaRecords in a custom format if 
needed by using a Custom Processor for your data type.  Example above shows a bunch of 
custom fields like URI that were parse out of the log files via a processor.  This can 
make it a bit easier for your pig scripts to get data out.


On Feb 4, 2010, at 7:24 AM, Vincent Barat wrote:

Hello,

I'm currently evaluating Chuckwa and I wonder if there is a way to use PIG to 
map/reduce the files produced by Chuckwa?

If yes, is there a special PIG loader to use?

What is the format of Chuckwa files? Is it just a concatenation of all logs 
sent by the agents?

Thanks for your help.
<vincent_barat.vcf>



<<attachment: vincent_barat.vcf>>

Reply via email to