Re: doubt about measure of processedRowCount

2018-11-06 Thread ShaoFeng Shi
Good job Jiatao! I appreciate your support to the community!

JiaTao Tao  于2018年11月7日周三 上午9:17写道:

> Very glad that my reply is helpful, I already opened a JIRA to add logs
> for "*GTStreamAggregateScanner*" and next time it would be much easier to
> navigate this :).
>
> cheney <531014...@qq.com> 于2018年11月6日周二 下午11:57写道:
>
>> Hi, JiaTao, thank you very much!  The statis is right when I config 
>> "kylin.query.stream-aggregate-enabled=false".
>> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>>
>>
>> -- 原始邮件 --
>> *发件人:* "JiaTao Tao";
>> *发送时间:* 2018年11月6日(星期二) 晚上10:50
>> *收件人:* "user";
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> One possible place I can find in the code is using
>> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
>> You can find it does do aggregate in
>> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
>> reduce the inputs. But there's no log printing in this class as you can
>> see, so it's pretty hard to confirm. Try
>> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
>> see any differences.
>>
>> cheney <531014...@qq.com> 于2018年11月5日周一 下午6:55写道:
>>
>>> Yes. the log is as following.
>>>
>>> 2018-11-02 22:25:34,980 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.StorageResponseGTScatter:88 : Using
>>> SortMergedPartitionResultIterator to merge 103 partition results
>>> 2018-11-02 22:25:34,982 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>>> merge segment results*
>>> 2018-11-02 22:25:34,982 DEBUG [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>>> : return TupleIterator...
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : 
>>> *Processed
>>> rows for each storageContext*: 366
>>> 2018-11-02 22:25:34,991 INFO  [Query
>>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>>> Stats of SQL response: isException: false, duration: 20, *total scan
>>> count 1552*
>>>
>>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>>> (total Agrrated/filterd in hbase)270 = 1282
>>>  *valueB *is much larger than *valueA *.
>>>
>>>
>>>
>>> -- 原始邮件 --
>>> *发件人:* "JiaTao Tao";
>>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>>> *收件人:* "user";
>>> *主题:* Re: doubt about measure of processedRowCount
>>>
>>> Can you grep logs like "to merge segment results" in that scenario?
>>>
>>> cheney <531014...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>>
>>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>>> quey in my scenario.
>>>> ---Original---
>>>> *From:* "JiaTao Tao"
>>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>>> *To:* "user";
>>>> *Subject:* Re: doubt about measure of processedRowCount
>>>>
>>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>>> one storageContext ).
>>>>
>>>> There are two good blogs about Kylin's query engine, you may take a
>>>> look :).
>>>>
>>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>>
>>>> https://zhuanlan.zhihu.com/p/30613434
>>>>
>>>> cheney <531014...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>>
>>>>> Hi, guys
>>>>>
>>>>> When I executed a sql in kylin, kylin server will log some log
>>>>> about query statics. for example, The log is as following:
>>>>>
>>>>>"Processed rows for each storageContext: *valueA*". *valueA *is 
>>>>> processedRowCount.
>>>>>
>>>>>What I understand is processedRowCount is the record rows
>>>>> numbers returned by hbase.
>>>>>
>>>>>Hbase corprocessor will log region stats, including:  "*Total
>>>>> scanned row*","Total filtered/aggred row".
>>>>>
>>>>> For  one region,  final records returned by hbase = *Total scanned
>>>>> row - *Total filtered/aggred row;
>>>>>Suppose this query need to scan 10 region in hbase, we can get
>>>>> every region stats. we can get all records  *valueB *returned by
>>>>> hbase by
>>>>>suming every final records in 10 region.
>>>>>
>>>>>   In general, *valueA *is equal to * valueB*, but *valueB *is
>>>>> much larger than *valueA* in sometimes. Why?
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> Regards!
>>>>
>>>> Aron Tao
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 
Best regards,

Shaofeng Shi 史少锋


Re: doubt about measure of processedRowCount

2018-11-06 Thread JiaTao Tao
Very glad that my reply is helpful, I already opened a JIRA to add logs for
"*GTStreamAggregateScanner*" and next time it would be much easier to
navigate this :).

cheney <531014...@qq.com> 于2018年11月6日周二 下午11:57写道:

> Hi, JiaTao, thank you very much!  The statis is right when I config 
> "kylin.query.stream-aggregate-enabled=false".
> You are right. Records are pre-aggregated by GTStreamAggregateScanner.
>
>
> -- 原始邮件 --
> *发件人:* "JiaTao Tao";
> *发送时间:* 2018年11月6日(星期二) 晚上10:50
> *收件人:* "user";
> *主题:* Re: doubt about measure of processedRowCount
>
> One possible place I can find in the code is using
> *GTStreamAggregateScanne*r (in "*SegmentCubeTupleIterator.java#111"*).
> You can find it does do aggregate in
> *"GTStreamAggregateScanner.AbstractStreamMergeIterator#next*" so it'll
> reduce the inputs. But there's no log printing in this class as you can
> see, so it's pretty hard to confirm. Try
> "kylin.query.stream-aggregate-enabled=false" and run the scenario again to
> see any differences.
>
> cheney <531014...@qq.com> 于2018年11月5日周一 下午6:55写道:
>
>> Yes. the log is as following.
>>
>> 2018-11-02 22:25:34,980 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.StorageResponseGTScatter:88 : Using
>> SortMergedPartitionResultIterator to merge 103 partition results
>> 2018-11-02 22:25:34,982 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914]
>> gtrecord.SequentialCubeTupleIterator:73 : Using Iterators.concat *to
>> merge segment results*
>> 2018-11-02 22:25:34,982 DEBUG [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] enumerator.OLAPEnumerator:122
>> : return TupleIterator...
>> 2018-11-02 22:25:34,991 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:897 : 
>> *Processed
>> rows for each storageContext*: 366
>> 2018-11-02 22:25:34,991 INFO  [Query
>> 03ea4f21-29ed-4b74-8faa-c57ecd44f412-198914] service.QueryService:422 :
>> Stats of SQL response: isException: false, duration: 20, *total scan
>> count 1552*
>>
>> Acoording the log,  *valueA *= 366. *valueB*= (total scan count) 1552 -
>> (total Agrrated/filterd in hbase)270 = 1282
>>  *valueB *is much larger than *valueA *.
>>
>>
>>
>> -- 原始邮件 --
>> *发件人:* "JiaTao Tao";
>> *发送时间:* 2018年11月5日(星期一) 下午2:41
>> *收件人:* "user";
>> *主题:* Re: doubt about measure of processedRowCount
>>
>> Can you grep logs like "to merge segment results" in that scenario?
>>
>> cheney <531014...@qq.com> 于2018年11月3日周六 下午4:15写道:
>>
>>> Thank your repling, .but I  am sure there's only one OlapContext in the
>>> quey in my scenario.
>>> ---Original---
>>> *From:* "JiaTao Tao"
>>> *Date:* Sat, Nov 3, 2018 10:42 AM
>>> *To:* "user";
>>> *Subject:* Re: doubt about measure of processedRowCount
>>>
>>> Maybe count all the *valueA *would be more appropriate, cuz maybe
>>> there's more than one OlapContext in the query ( one OlapContext correspond
>>> one storageContext ).
>>>
>>> There are two good blogs about Kylin's query engine, you may take a look
>>> :).
>>>
>>> https://blog.csdn.net/yu616568/article/details/50838504
>>>
>>> https://zhuanlan.zhihu.com/p/30613434
>>>
>>> cheney <531014...@qq.com> 于2018年11月2日周五 下午11:10写道:
>>>
>>>> Hi, guys
>>>>
>>>> When I executed a sql in kylin, kylin server will log some log
>>>> about query statics. for example, The log is as following:
>>>>
>>>>"Processed rows for each storageContext: *valueA*". *valueA *is 
>>>> processedRowCount.
>>>>
>>>>What I understand is processedRowCount is the record rows
>>>> numbers returned by hbase.
>>>>
>>>>Hbase corprocessor will log region stats, including:  "*Total
>>>> scanned row*","Total filtered/aggred row".
>>>>
>>>> For  one region,  final records returned by hbase = *Total scanned
>>>> row - *Total filtered/aggred row;
>>>>Suppose this query need to scan 10 region in hbase, we can get
>>>> every region stats. we can get all records  *valueB *returned by hbase
>>>> by
>>>>suming every final records in 10 region.
>>>>
>>>>   In general, *valueA *is equal to * valueB*, but *valueB *is much
>>>> larger than *valueA* in sometimes. Why?
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Regards!
>>>
>>> Aron Tao
>>>
>>
>>
>> --
>>
>>
>> Regards!
>>
>> Aron Tao
>>
>
>
> --
>
>
> Regards!
>
> Aron Tao
>


-- 


Regards!

Aron Tao