Got your concern. Nice catch ;)

However in such case you describe, whether kylin yield an unwanted big
query scan depends on how the dictionary search is implemented rather than
using dictionary or not.

Take "select id, count(*) from the_table where id >= a and id <= b group
by id” as example, even when a is not found in the dictionary, if the
dictionary can return the value closest to the code of value a, the scan
range will be equal to using raw value which not applying dictionary.

Currently we have done such optimizations for query using ‘>’, ‘>=’ etc,
and IN clause is still on the way.

Best Regard
Zhou QianHao





On 5/2/15, 11:18 PM, "Huang Hua" <[email protected]> wrote:

>I understand that dictionary will preserve the order as well as save
>storage space.  And for those "IN" queries, dropping the unfounded values
>is indeed a solution to improve performance.
>
>However, when it comes to sql query, I am not sure if the dictionary
>still helps. Let's say we have a query like "select id, count(*) from
>the_table where id >= a and id <= b group by id", and if "a" is not found
>in the dictionary of "id", I suppose kylin would use some sort of default
>value to replace "a" to get hbase scan range which could yield an
>unwanted big query scan.
>
>But If we disable the dictionary option for "id" column when building the
>cube, I guess that kylin would directly use the raw value "a" instead of
>the default value to come up with the scan range for the sql query, which
>I think it's definitely better compared to the dictionary approach.
>
>Please correct me if I understood it wrong or missed something.. Thanks.
>
>Hua
>
>> -----邮件原件-----
>> 发件人: dev-return-1394-
>> [email protected] [mailto:dev-return-
>> [email protected]] 代表 周千昊
>> 发送时间: 2015年5月2日 0:32
>> 收件人: [email protected]
>> 主题: Re: How to improve kylin query range scan?
>> 
>> Hi, Huang
>>       Actually using dictionary will not affect the scan range, because
>>the
>> dictionary we are using will preserve the order. And one of the most
>> important reason to use dictionary is storage saving.
>>       However for this case, some optimisation can be made, for example,
>> when the value in IN clause can not be translated into dictionary id,
>>we can
>> simply ignore this value.
>>       A jira ticket has been created:
>> https://issues.apache.org/jira/browse/KYLIN-747
>> 
>> Huang Hua <[email protected]>于2015年4月30日周四 下午
>> 6:15写道:
>> 
>> > Hi,
>> >
>> >
>> >
>> > Recently we were trying queries like this: select id, count(distinct
>> > no) from the_table where id in (x1, x2, x3, .) group by id, where x1,
>>x2, x3, .
>> > are the actual id values.
>> >
>> > And we found out that the performance of kylin query would drop
>> > significantly if the values in where clause can't be translated into
>> > dictionary id.
>> >
>> >
>> >
>> > What I mean is that if let's say the cube doesn't contain id values of
>> > x2, when running the above query, the total scan count will be much
>> > larger than the scan count if the cube contains all the x values.
>> >
>> >
>> >
>> > For example, we had a query of 39 x values in where clause and there
>> > is one x value not in cube,  which yielded the following result:
>> >
>> > Duration: 60.947
>> >
>> >
>> > Cube Names: [olap]
>> >
>> > Total scan count: 2524898
>> >
>> > Result row count: 39
>> >
>> > (The log shows "Can't translate value xxx to dictionary ID,
>>roundingFlag 0.
>> > Using default value \xFF")
>> >
>> >
>> >
>> > And we excluded the x value that is not in cube and re-run the query
>> > and got the another result:
>> >
>> > Duration: 2.477
>> >
>> > Cube Names: [olap]
>> >
>> > Total scan count: 96543
>> >
>> > Result row count: 38
>> >
>> >
>> >
>> > The second query runs much faster just because it removes an id value
>> > that is not in cube. Can anyone share some ideas about this? Would not
>> > building dictionary and just using raw values for id column to build
>> > cube be a solution to improve the performance?
>> >
>> >
>> >
>> > Best Regards
>> >
>> > Hua
>> >
>> >
>
>

Reply via email to