I support solution 2.
We add a new shuffle type, so we have to expand our agg code to support the new cases. At 2022-06-26 21:49:19, "蔡聪辉" <caiconghui2...@163.com> wrote: > > > >Hi, all devs. Now Doris has supported random distribution for Duplicate Keys >table and Aggregate Keys table without replace(replace_if_not_null) type >column to solve data skew problem. > >But now, It brings some problem when we do some speical queris on aggregate >table with random distribution. > >For example, if we have a aggreate table like following: > > > > >Table agg_table > > > > >ColumnNameType Aggregate Type > >k1char(5) > >v1 int max > > > > >Origin Data Source, which include 5 rows : > > >a 1 > >a 2 > >b 3 > >b 4 > >b 5 > > > > >In hash distribution, the Doris can ensure that diffrent rows with same >aggregate keys always in the same tablet(replica), > > > > >so that the result for select * from agg_table is same with select * from >(select k1, max(v1) from agg_table); > > > > >but for the random distribution, aggregate keys table with diffrent rows with >same aggregate keys may be in the difffrent tablet(replica), > > > > >which may result in different reuslt from hash distribution. > > > > >But what is the real root cause? This is because Doris leave out the aggregate >function and group by statement if needed when do select column operation from >aggregate table, > >and do aggregate function with group by operation only in the Doris storage >layer. > > > > >In the current design, count(1) and select just value column without aggregate >funcion or group by stament for aggregate table now is diffrent from query >reuslt on rollup which base on duiplicate table. > > > > >For example, if we have duplcate table > >Table dup_table > >ColumnName Type > >k1 char(5) > >v1 int > > > > >the the rollup is > > Rollup rollup_table > >ColumnNameType Aggregate Type > >k1char(5) > >v1 int max > > > > >if we exectue query like "select * from dup_table " or "select count(1) from >dup_table", we won't hit the rollup even if query rollup will cost less, this >because select * or select count(1) > >not really match the v1 column with aggregate function like min. > > > > >So. to make query result on random distribution consistent with hash >distribution for aggregate table, here are two main solutions. > >1. forbid directly select * from aggregate table, forbid select count(1) >directly from aggregate table, and give the right sql to remind user that they >should use aggregate function on value column or write group by statement >explicitly. > >2. keep the origin design, and we rewrite sql when user execute some special >queries on aggregate table. > > >These two solutions have their own advantages and disadvantages. You are >welcome to give your opinions, feel free to discuss. >