[ 
https://issues.apache.org/jira/browse/CHUKWA-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867624#action_12867624
 ] 

Bill Graham commented on CHUKWA-481:
------------------------------------

I agree that being able to configure the default partitioner like we currently 
do with the default mapper/reducer would be best. That way whatever is decided 
to be the hard-coded 'reasonable default' can be overriden in configs. Being 
able to configure partitioner-per-dataType isn't a use case for us. If we 
choose not so support it now, we should at lease leave the configuration model 
open to support it in the future.

> Improve demux reducer partitioning algorithm
> --------------------------------------------
>
>                 Key: CHUKWA-481
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-481
>             Project: Hadoop Chukwa
>          Issue Type: Improvement
>          Components: MR Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>
> Reducer partitioning for demux could be redefined to optimize for two 
> different use case:
> Case #1, demux is responsible for crunching large volumes of the same data 
> type (dozen of types).  It will probably make more sense to partition the 
> reducer by time grouping + data type (extend TotalOrderPartitioner).  I.e. A 
> user can have evenly distributed workload for each reducer base on time 
> interval.  A distributed hash table like Hbase/voldermort could be the down 
> stream system to store/cache the data for data serving.  This model is great 
> for collecting fixed time interval logs like hadoop metrics, and ExecAdaptor 
> which generates repetitive time series summary.
>  
> Case #2, demux is responsible for crunching hundred of different data type, 
> but small volumn for each data type.  The current demux implementation is 
> using this model, where a single data type is reduced by one reducer slot 
> (ChukwaRecordPartitioner).  One draw back from this model,the data from each 
> data type must have similar volume.  Otherwise, the largest data volume type 
> becomes the long tail of the mapreduce job.  Materialized report is easy to 
> generate by using this model because the single reducer per data type has 
> view to all data of the given demux run.  This model works great for many 
> different application and all logging through Chukwa Log4j appender.  I.e. 
> web crawl, or log file indexing / viewing.
>  
> I am thinking to change the default Chukwa demux implementation to case #1, 
> and restructure the current demux as Archive Organizer.  Any suggestion or 
> objection?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to