agree

> Date: Tue, 3 Mar 2015 15:35:23 +0800
> Subject: Re: proposal of cube building optimization
> From: [email protected]
> To: [email protected]
> 
> This proposal is the same as https://issues.apache.org/jira/browse/KYLIN-607
> that I created earlier.
> 
> @宋轶, the difference to our very first POC is that, here the mapper outputs
> the aggregated result of a small chunk of records, KVs of a micro segment,
> not the very raw records.
> 
> In the ideal case, the solution could achieve 1 * [Total Cube Size]
> shuffling when there's a mandatory dimension and each mapper takes a
> different piece on the dimension. E.g. month is mandatory and each mapper
> is assign a different month data. Then every mapper's output won't
> duplicate. And the shuffle size is optimal.
> 
> Of course, in the worst case, the shuffle size might be times of the
> current. So it really depends on the data set and aggregation config. What
> we are seeing now is more often than not, date/time will be a mandatory
> column, and if that's true, the new method will have an edge.
> 
> Cheers
> Yang
> 
> 
> 
> On Mon, Mar 2, 2015 at 2:13 PM, 蒋旭 <[email protected]> wrote:
> 
> > 1. One step building is more suitable for incremental building that has
> > small data size. Full building on large data set can still use multiple
> > stage building.
> >
> >
> > 2. Since mapper will manage memory by itself, it will cache the
> > intermediate result in memory as more as possible. Moreover, mapper will do
> > preaggregation in memory just like combiner. In this way,  it should reduce
> > the shuffle data size.
> >
> >
> > 3. Since it's one step building, the data read size and job schedule
> > latency should be much less.
> >
> >
> > Thanks
> > Jiang Xu
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: Ted Dunning <[email protected]>
> > 发送时间: 2015年03月02日 13:52
> > 收件人: dev <[email protected]>
> > 主题: Re: proposal of cube building optimization
> >
> >
> >
> > On Mon, Mar 2, 2015 at 6:47 AM, 宋轶 <[email protected]> wrote:
> >
> > > The problem of it is that each mapper will generate too much intermediate
> > > data, and the network will be the bottleneck in Shuffle phase
> >
> >
> > This would prevent multiple passes over the input data.  Is there a
> > difference in the amount of shuffled data from the amount that would be
> > shuffled by multiple map-reduce steps?
> >
                                          

Reply via email to