[
https://issues.apache.org/jira/browse/CHUKWA-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12867624#action_12867624
]
Bill Graham commented on CHUKWA-481:
------------------------------------
I agree that being able to configure the default partitioner like we currently
do with the default mapper/reducer would be best. That way whatever is decided
to be the hard-coded 'reasonable default' can be overriden in configs. Being
able to configure partitioner-per-dataType isn't a use case for us. If we
choose not so support it now, we should at lease leave the configuration model
open to support it in the future.
> Improve demux reducer partitioning algorithm
> --------------------------------------------
>
> Key: CHUKWA-481
> URL: https://issues.apache.org/jira/browse/CHUKWA-481
> Project: Hadoop Chukwa
> Issue Type: Improvement
> Components: MR Data Processors
> Environment: Redhat EL 5.1, Java 6
> Reporter: Eric Yang
> Assignee: Eric Yang
>
> Reducer partitioning for demux could be redefined to optimize for two
> different use case:
> Case #1, demux is responsible for crunching large volumes of the same data
> type (dozen of types). It will probably make more sense to partition the
> reducer by time grouping + data type (extend TotalOrderPartitioner). I.e. A
> user can have evenly distributed workload for each reducer base on time
> interval. A distributed hash table like Hbase/voldermort could be the down
> stream system to store/cache the data for data serving. This model is great
> for collecting fixed time interval logs like hadoop metrics, and ExecAdaptor
> which generates repetitive time series summary.
>
> Case #2, demux is responsible for crunching hundred of different data type,
> but small volumn for each data type. The current demux implementation is
> using this model, where a single data type is reduced by one reducer slot
> (ChukwaRecordPartitioner). One draw back from this model,the data from each
> data type must have similar volume. Otherwise, the largest data volume type
> becomes the long tail of the mapreduce job. Materialized report is easy to
> generate by using this model because the single reducer per data type has
> view to all data of the given demux run. This model works great for many
> different application and all logging through Chukwa Log4j appender. I.e.
> web crawl, or log file indexing / viewing.
>
> I am thinking to change the default Chukwa demux implementation to case #1,
> and restructure the current demux as Archive Organizer. Any suggestion or
> objection?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.