I have not studied hive in depth. Jerome said he has done this, perhaps he could share his experience.
Regards, Eric On 2/23/10 9:46 AM, "Oded Rosen" <o...@legolas-media.com> wrote: > Thanks Eric, > I have managed to write my own processor and to get the output as > ChukwaRecords with our own customized fields in them. > Now, I get to the part where I try to load this output into hive (or actually > use the output dir, /repos, as the data directory of a Hive table). > In this stage I need to let Hive recognize the ChukwaRecordKey + ChukwaRecord > SerDes, so I need your help with that. > > I've seen that integration with Pig is pretty straighforward for Chukwa (using > Chukwa-Pig.jar), but our idea is to automate the whole process straight into a > table, and with Hive you can just define a directory as a hive table input. If > we could get the data in a form that hive can regconize, we will not need > another stage after the Demux. > > Can you think of a way to do this? > > Thanks, > > > On Mon, Feb 22, 2010 at 7:31 PM, Eric Yang <ey...@yahoo-inc.com> wrote: >> Hi Oded, >> >> If you are using the code from TRUNK, instruction here: >> >> - Package your mapper and reducer classes, and put in a jar file. >> - Upload parser jar file to hdfs://host:port/chukwa/demux >> - Configure CHUKWA_CONF_DIR/chukwa-demux-conf.xml, add new record type >> reference to your class names in Demux aliases section. >> >> If you are using Chukwa 0.3.0, instruction here: >> >> - Package your mapper and reducer classes into chukwa-core-0.3.0.jar >> - Configure CHUKWA_CONF_DIR/chukwa-demux-conf.xml, add new record type >> reference to your class names in Demux aliases section. >> >> Hope this helps. >> >> Regards, >> Eric >> >> On 2/22/10 7:28 AM, "Oded Rosen" <o...@legolas-media.com> wrote: >> >>> I have just sent this mail to Ari, but it is probably wise to share it will >>> all of you: >>> >>> Hello Ari, >>> I'm Oded Rosen, with Legolas Media R&D team. >>> We would like to use Chukwa to pass data from our real time servers into our >>> hadoop cluster. The dataflow already reaches several GB/day, and we are >>> about >>> to extend this in the near future. >>> Our main aim is to process raw data (in the form of >>> fieldname1=value1<tab>fieldname2=value2....\n) into a format that fits >>> straight into Hive, for a later processing. >>> >>> We are already running a DirTailingAdaptor on our input directory, and >>> recieve >>> the the collected data in the chukwa/logs dir. >>> Now, we would like to write our own Demux processor, in order to process the >>> sink data, get only the fields we need from it, format the data and write it >>> to the output directory, which will be defined as the input directory of a >>> Hive table. >>> >>> We have already written mapper/reducer classes that know how to extract the >>> wanted fields from the raw data and apply the needed formats. >>> We want to set a Demux processor with these classes as the map/reduce >>> classes, >>> but we could not find any documentation about how to do it. >>> All we could do until now is to run the default demux that just copies the >>> data into the output directory. >>> We will appreciate any help you can offer us. >> > >