Data partitioning for demux

Eric Yang Sun, 25 Apr 2010 12:09:31 -0700

Hi all,

I am working on enhancing the reducer partitioning for demux.  It basically
boils down to two main use cases.


Case #1, demux is responsible for crunching large volumes of the same data
type (dozen of types).  It will probably make more sense to partition the
reducer by time grouping + data type (extend TotalOrderPartitioner).  I.e. A
user can have evenly distributed workload for each reducer base on time
interval.  A distributed hash table like Hbase/voldermort could be the down
stream system to store/cache the data for data serving.  This model is great
for collecting fixed time interval logs like hadoop metrics, and ExecAdaptor
which generates repetitive time series summary.

Case #2, demux is responsible for crunching hundred of different data type,
but small volumn for each data type.  The current demux implementation is
using this model, where a single data type is reduced by one reducer slot
(ChukwaRecordPartitioner).  One draw back from this model,the data from each
data type must have similar volume.  Otherwise, the largest data volume type
becomes the long tail of the mapreduce job.  Materialized report is easy to
generate by using this model because the single reducer per data type has
view to all data of the given demux run.  This model works great for many
different application and all logging through Chukwa Log4j appender.  I.e.
web crawl, or log file indexing / viewing.

I am thinking to change the default Chukwa demux implementation to case #1,
and restructure the current demux as Archive Organizer.  Any suggestion or
objection?

Regards,
Eric

Data partitioning for demux

Reply via email to