Re: Chukwa integration for Legolas Media real-time servers

Bill Graham Tue, 23 Feb 2010 12:02:02 -0800

One challenge that you'd face when trying to have Hive read Chukwa sequence
files is that the locations where Chukwa generates files will change over
time.


Chukwa creates 5 minute roll-ups in directories that are named with the
date/hour/5minute of the time interval. At the end of an hour, Chukwa
combines all of the 5 minute chunks into an hourly chunk in date/hour. At
the end of the day they'd be in date/.

Hive has the ability to use files in a given directory as an external table,
but I don't think it can handle a directory with a tree of nodes beneath it
that come and go over time, with each node being a different combination of
day/hour/minute partitions. *I think* that the Hive metastore needs to be
updated when new partitions are added/removed.

This type of functionality would be very cool though, since you could query
Hive without worrying about where your Chukwa data is in the lifecycle of
it's roll-up. I'd be curious to hear from someone with a deeper
understanding of Hive though, to see if my assessment of the limitations is
correct.

thanks,
Bill


On Tue, Feb 23, 2010 at 10:53 AM, Eric Yang <ey...@yahoo-inc.com> wrote:

> I have not studied hive in depth.  Jerome said he has done this,  perhaps
> he
> could share his experience.
>
> Regards,
> Eric
>
>
> On 2/23/10 9:46 AM, "Oded Rosen" <o...@legolas-media.com> wrote:
>
> > Thanks Eric,
> > I have managed to write my own processor and to get the output as
> > ChukwaRecords with our own customized fields in them.
> > Now, I get to the part where I try to load this output into hive (or
> actually
> > use the output dir, /repos, as the data directory of a Hive table).
> > In this stage I need to let Hive recognize the ChukwaRecordKey +
> ChukwaRecord
> > SerDes, so I need your help with that.
> >
> > I've seen that integration with Pig is pretty straighforward for Chukwa
> (using
> > Chukwa-Pig.jar), but our idea is to automate the whole process straight
> into a
> > table, and with Hive you can just define a directory as a hive table
> input. If
> > we could get the data in a form that hive can regconize, we will not need
> > another stage after the Demux.
> >
> > Can you think of a way to do this?
> >
> > Thanks,
> >
> >
> > On Mon, Feb 22, 2010 at 7:31 PM, Eric Yang <ey...@yahoo-inc.com> wrote:
> >> Hi Oded,
> >>
> >> If you are using the code from TRUNK, instruction here:
> >>
> >> - Package your mapper and reducer classes, and put in a jar file.
> >> - Upload parser jar file to hdfs://host:port/chukwa/demux
> >> - Configure CHUKWA_CONF_DIR/chukwa-demux-conf.xml, add new record type
> >> reference to your class names in  Demux aliases section.
> >>
> >> If you are using Chukwa 0.3.0, instruction here:
> >>
> >> - Package your mapper and reducer classes into chukwa-core-0.3.0.jar
> >> - Configure CHUKWA_CONF_DIR/chukwa-demux-conf.xml, add new record type
> >> reference to your class names in Demux aliases section.
> >>
> >> Hope this helps.
> >>
> >> Regards,
> >> Eric
> >>
> >> On 2/22/10 7:28 AM, "Oded Rosen" <o...@legolas-media.com> wrote:
> >>
> >>> I have just sent this mail to Ari, but it is probably wise to share it
> will
> >>> all of you:
> >>>
> >>> Hello Ari,
> >>> I'm Oded Rosen, with Legolas Media R&D team.
> >>> We would like to use Chukwa to pass data from our real time servers
> into our
> >>> hadoop cluster. The dataflow already reaches several GB/day, and we are
> >>> about
> >>> to extend this in the near future.
> >>> Our main aim is to process raw data (in the form of
> >>> fieldname1=value1<tab>fieldname2=value2....\n) into a format that fits
> >>> straight into Hive, for a later processing.
> >>>
> >>> We are already running a DirTailingAdaptor on our input directory, and
> >>> recieve
> >>> the the collected data in the chukwa/logs dir.
> >>> Now, we would like to write our own Demux processor, in order to
> process the
> >>> sink data, get only the fields we need from it, format the data and
> write it
> >>> to the output directory, which will be defined as the input directory
> of a
> >>> Hive table.
> >>>
> >>> We have already written mapper/reducer classes that know how to extract
> the
> >>> wanted fields from the raw data and apply the needed formats.
> >>> We want to set a Demux processor with these classes as the map/reduce
> >>> classes,
> >>> but we could not find any documentation about how to do it.
> >>> All we could do until now is to run the default demux that just copies
> the
> >>> data into the output directory.
> >>> We will appreciate any help you can offer us.
> >>
> >
> >
>
>

Re: Chukwa integration for Legolas Media real-time servers

Reply via email to