agree
> Date: Tue, 3 Mar 2015 15:35:23 +0800 > Subject: Re: proposal of cube building optimization > From: [email protected] > To: [email protected] > > This proposal is the same as https://issues.apache.org/jira/browse/KYLIN-607 > that I created earlier. > > @宋轶, the difference to our very first POC is that, here the mapper outputs > the aggregated result of a small chunk of records, KVs of a micro segment, > not the very raw records. > > In the ideal case, the solution could achieve 1 * [Total Cube Size] > shuffling when there's a mandatory dimension and each mapper takes a > different piece on the dimension. E.g. month is mandatory and each mapper > is assign a different month data. Then every mapper's output won't > duplicate. And the shuffle size is optimal. > > Of course, in the worst case, the shuffle size might be times of the > current. So it really depends on the data set and aggregation config. What > we are seeing now is more often than not, date/time will be a mandatory > column, and if that's true, the new method will have an edge. > > Cheers > Yang > > > > On Mon, Mar 2, 2015 at 2:13 PM, 蒋旭 <[email protected]> wrote: > > > 1. One step building is more suitable for incremental building that has > > small data size. Full building on large data set can still use multiple > > stage building. > > > > > > 2. Since mapper will manage memory by itself, it will cache the > > intermediate result in memory as more as possible. Moreover, mapper will do > > preaggregation in memory just like combiner. In this way, it should reduce > > the shuffle data size. > > > > > > 3. Since it's one step building, the data read size and job schedule > > latency should be much less. > > > > > > Thanks > > Jiang Xu > > > > > > ------------------ 原始邮件 ------------------ > > 发件人: Ted Dunning <[email protected]> > > 发送时间: 2015年03月02日 13:52 > > 收件人: dev <[email protected]> > > 主题: Re: proposal of cube building optimization > > > > > > > > On Mon, Mar 2, 2015 at 6:47 AM, 宋轶 <[email protected]> wrote: > > > > > The problem of it is that each mapper will generate too much intermediate > > > data, and the network will be the bottleneck in Shuffle phase > > > > > > This would prevent multiple passes over the input data. Is there a > > difference in the amount of shuffled data from the amount that would be > > shuffled by multiple map-reduce steps? > >
