One challenge that you'd face when trying to have Hive read Chukwa sequence files is that the locations where Chukwa generates files will change over time.
Chukwa creates 5 minute roll-ups in directories that are named with the date/hour/5minute of the time interval. At the end of an hour, Chukwa combines all of the 5 minute chunks into an hourly chunk in date/hour. At the end of the day they'd be in date/. Hive has the ability to use files in a given directory as an external table, but I don't think it can handle a directory with a tree of nodes beneath it that come and go over time, with each node being a different combination of day/hour/minute partitions. *I think* that the Hive metastore needs to be updated when new partitions are added/removed. This type of functionality would be very cool though, since you could query Hive without worrying about where your Chukwa data is in the lifecycle of it's roll-up. I'd be curious to hear from someone with a deeper understanding of Hive though, to see if my assessment of the limitations is correct. thanks, Bill On Tue, Feb 23, 2010 at 10:53 AM, Eric Yang <ey...@yahoo-inc.com> wrote: > I have not studied hive in depth. Jerome said he has done this, perhaps > he > could share his experience. > > Regards, > Eric > > > On 2/23/10 9:46 AM, "Oded Rosen" <o...@legolas-media.com> wrote: > > > Thanks Eric, > > I have managed to write my own processor and to get the output as > > ChukwaRecords with our own customized fields in them. > > Now, I get to the part where I try to load this output into hive (or > actually > > use the output dir, /repos, as the data directory of a Hive table). > > In this stage I need to let Hive recognize the ChukwaRecordKey + > ChukwaRecord > > SerDes, so I need your help with that. > > > > I've seen that integration with Pig is pretty straighforward for Chukwa > (using > > Chukwa-Pig.jar), but our idea is to automate the whole process straight > into a > > table, and with Hive you can just define a directory as a hive table > input. If > > we could get the data in a form that hive can regconize, we will not need > > another stage after the Demux. > > > > Can you think of a way to do this? > > > > Thanks, > > > > > > On Mon, Feb 22, 2010 at 7:31 PM, Eric Yang <ey...@yahoo-inc.com> wrote: > >> Hi Oded, > >> > >> If you are using the code from TRUNK, instruction here: > >> > >> - Package your mapper and reducer classes, and put in a jar file. > >> - Upload parser jar file to hdfs://host:port/chukwa/demux > >> - Configure CHUKWA_CONF_DIR/chukwa-demux-conf.xml, add new record type > >> reference to your class names in Demux aliases section. > >> > >> If you are using Chukwa 0.3.0, instruction here: > >> > >> - Package your mapper and reducer classes into chukwa-core-0.3.0.jar > >> - Configure CHUKWA_CONF_DIR/chukwa-demux-conf.xml, add new record type > >> reference to your class names in Demux aliases section. > >> > >> Hope this helps. > >> > >> Regards, > >> Eric > >> > >> On 2/22/10 7:28 AM, "Oded Rosen" <o...@legolas-media.com> wrote: > >> > >>> I have just sent this mail to Ari, but it is probably wise to share it > will > >>> all of you: > >>> > >>> Hello Ari, > >>> I'm Oded Rosen, with Legolas Media R&D team. > >>> We would like to use Chukwa to pass data from our real time servers > into our > >>> hadoop cluster. The dataflow already reaches several GB/day, and we are > >>> about > >>> to extend this in the near future. > >>> Our main aim is to process raw data (in the form of > >>> fieldname1=value1<tab>fieldname2=value2....\n) into a format that fits > >>> straight into Hive, for a later processing. > >>> > >>> We are already running a DirTailingAdaptor on our input directory, and > >>> recieve > >>> the the collected data in the chukwa/logs dir. > >>> Now, we would like to write our own Demux processor, in order to > process the > >>> sink data, get only the fields we need from it, format the data and > write it > >>> to the output directory, which will be defined as the input directory > of a > >>> Hive table. > >>> > >>> We have already written mapper/reducer classes that know how to extract > the > >>> wanted fields from the raw data and apply the needed formats. > >>> We want to set a Demux processor with these classes as the map/reduce > >>> classes, > >>> but we could not find any documentation about how to do it. > >>> All we could do until now is to run the default demux that just copies > the > >>> data into the output directory. > >>> We will appreciate any help you can offer us. > >> > > > > > >