Improve demux reducer partitioning algorithm
--------------------------------------------

                 Key: CHUKWA-481
                 URL: https://issues.apache.org/jira/browse/CHUKWA-481
             Project: Hadoop Chukwa
          Issue Type: Improvement
          Components: MR Data Processors
         Environment: Redhat EL 5.1, Java 6
            Reporter: Eric Yang
            Assignee: Eric Yang


Reducer partitioning for demux could be redefined to optimize for two different 
use case:

Case #1, demux is responsible for crunching large volumes of the same data type 
(dozen of types).  It will probably make more sense to partition the reducer by 
time grouping + data type (extend TotalOrderPartitioner).  I.e. A user can have 
evenly distributed workload for each reducer base on time interval.  A 
distributed hash table like Hbase/voldermort could be the down stream system to 
store/cache the data for data serving.  This model is great for collecting 
fixed time interval logs like hadoop metrics, and ExecAdaptor which generates 
repetitive time series summary.
 
Case #2, demux is responsible for crunching hundred of different data type, but 
small volumn for each data type.  The current demux implementation is using 
this model, where a single data type is reduced by one reducer slot 
(ChukwaRecordPartitioner).  One draw back from this model,the data from each 
data type must have similar volume.  Otherwise, the largest data volume type 
becomes the long tail of the mapreduce job.  Materialized report is easy to 
generate by using this model because the single reducer per data type has view 
to all data of the given demux run.  This model works great for many different 
application and all logging through Chukwa Log4j appender.  I.e. web crawl, or 
log file indexing / viewing.
 
I am thinking to change the default Chukwa demux implementation to case #1, and 
restructure the current demux as Archive Organizer.  Any suggestion or 
objection?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to