I understand that dictionary will preserve the order as well as save storage space. And for those "IN" queries, dropping the unfounded values is indeed a solution to improve performance.
However, when it comes to sql query, I am not sure if the dictionary still helps. Let's say we have a query like "select id, count(*) from the_table where id >= a and id <= b group by id", and if "a" is not found in the dictionary of "id", I suppose kylin would use some sort of default value to replace "a" to get hbase scan range which could yield an unwanted big query scan. But If we disable the dictionary option for "id" column when building the cube, I guess that kylin would directly use the raw value "a" instead of the default value to come up with the scan range for the sql query, which I think it's definitely better compared to the dictionary approach. Please correct me if I understood it wrong or missed something.. Thanks. Hua > -----邮件原件----- > 发件人: dev-return-1394- > [email protected] [mailto:dev-return- > [email protected]] 代表 周千昊 > 发送时间: 2015年5月2日 0:32 > 收件人: [email protected] > 主题: Re: How to improve kylin query range scan? > > Hi, Huang > Actually using dictionary will not affect the scan range, because the > dictionary we are using will preserve the order. And one of the most > important reason to use dictionary is storage saving. > However for this case, some optimisation can be made, for example, > when the value in IN clause can not be translated into dictionary id, we can > simply ignore this value. > A jira ticket has been created: > https://issues.apache.org/jira/browse/KYLIN-747 > > Huang Hua <[email protected]>于2015年4月30日周四 下午 > 6:15写道: > > > Hi, > > > > > > > > Recently we were trying queries like this: select id, count(distinct > > no) from the_table where id in (x1, x2, x3, .) group by id, where x1, x2, > > x3, . > > are the actual id values. > > > > And we found out that the performance of kylin query would drop > > significantly if the values in where clause can't be translated into > > dictionary id. > > > > > > > > What I mean is that if let's say the cube doesn't contain id values of > > x2, when running the above query, the total scan count will be much > > larger than the scan count if the cube contains all the x values. > > > > > > > > For example, we had a query of 39 x values in where clause and there > > is one x value not in cube, which yielded the following result: > > > > Duration: 60.947 > > > > > > Cube Names: [olap] > > > > Total scan count: 2524898 > > > > Result row count: 39 > > > > (The log shows "Can't translate value xxx to dictionary ID, roundingFlag 0. > > Using default value \xFF") > > > > > > > > And we excluded the x value that is not in cube and re-run the query > > and got the another result: > > > > Duration: 2.477 > > > > Cube Names: [olap] > > > > Total scan count: 96543 > > > > Result row count: 38 > > > > > > > > The second query runs much faster just because it removes an id value > > that is not in cube. Can anyone share some ideas about this? Would not > > building dictionary and just using raw values for id column to build > > cube be a solution to improve the performance? > > > > > > > > Best Regards > > > > Hua > > > >
