Re: ChukwaRecordOutputFormat only works with ChukwaRecordPartitioner

Corbin Hoenes Thu, 22 Jul 2010 12:34:54 -0700

-getmerge seems to work...  any other suggestions on formats?  I like the idea 
of making the filename more hadoopy looking.
MyDataType_20100720_0_35_part-00001.R?  Might require more code change to tack 
it onto the extension haven't looked at that bit of code yet.


On Jul 21, 2010, at 10:35 AM, Eric Yang wrote:

> I think this is in the right direction.  Does this filename convention allows 
> dfs –getmerge to work on the directory?  If it does, then I am fine with it.  
> If it doesn’t, it may be good to label output file name  as 
> MyDataType_20100720_0_35.R_part0 to align with default output name of 
> mapreduce.
> 
> Regards,
> Eric
> 
> On 7/20/10 11:48 PM, "Corbin Hoenes" <cor...@tynt.com> wrote:
> 
>> I was looking at replacing the ChukwaRecordPartitioner with a 
>> HashbasedRecordParitioner. We discussed this earlier here.... there is an 
>> issue in JIRA: https://issues.apache.org/jira/browse/CHUKWA-481
>> 
>> I patched chukwa to allow for a pluggable partitioner and configured chukwa 
>> to use the hash based partitioner.  But it started failing to rename the 
>> _temporary files during the commit phase after the reduce was finished 
>> because now there were multiple reducers trying to move files to 
>> /chukwa/demuxProcessing/mrOutput with the same filename.   So I added a bit 
>> more to the filename in ChukwaRecordOutputFormat
>> 
>> private String getParition(ChukwaRecordKey key, ChukwaRecord record) {
>> return "part" + paritioner.getPartition(key, record, 
>> conf.getInt("mapred.reduce.tasks", 0)); 
>> }
>> 
>> @Override
>> protected String generateFileNameForKeyValue(ChukwaRecordKey key,
>> ChukwaRecord record, String name) {
>> 
>> String output = RecordUtil.getClusterName(record) + "/"
>> + key.getReduceType() + "/" + key.getReduceType() + "_" + getParition(key, 
>> record)
>> + Util.generateTimeOutput(record.getTime());
>> 
>> return output;
>> } 
>> 
>> So my filenames are now 
>> /chukwa/demuxProcessing/mrOutput/MyCluster/MyDataType/MyDataType_part0_20100720_0_35.R.evt
>> 
>> Just added the part to the filename and now when PostProcessorManager picks 
>> up that directory it can mv each file into the correctly time bucket in 
>> /chukwa/repos (it increments a count for each file in that directory.
>> 
>> Is there a better solution--I am not sure how general purpose my solution is.
>>

Re: ChukwaRecordOutputFormat only works with ChukwaRecordPartitioner

Reply via email to