[ 
https://issues.apache.org/jira/browse/HBASE-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell closed HBASE-14324.
---------------------------------------

> MetricSampleQuantiles maintains quantiles summary in a wrong way
> ----------------------------------------------------------------
>
>                 Key: HBASE-14324
>                 URL: https://issues.apache.org/jira/browse/HBASE-14324
>             Project: HBase
>          Issue Type: Bug
>          Components: metrics
>            Reporter: Rafi Shachar
>            Priority: Major
>
> The MetricSampleQuantiles computes quantiles estimations for data stream 
> according to a paper published by CKMS in 2005. However the implementation is 
> incorrect compared to the paper. In particular, the insert and compact 
> methods use the rank of an item in the summary whereas the rank should be the 
> estimated rank in the stream. Due to this bug the resulting summary is much 
> larger than required: according to my experiments it is more than 10 times 
> larger. Furthermore, the summary size continues to grow with stream size 
> while the distribution doesn't change. When the number of items is in the 
> tens of millions the summary size is about 200K while it should be in the 
> range of few thousands. This has significant effect on performance. I didn't 
> see significant effect on accuracy.
> The insert batch and compress methods call allowableError() passing the rank 
> of the item in the summary. It actually should pass the estimated rank in the 
> stream. In the CKMS paper this rank is denoted by r_i, which more precisely 
> is the estimated rank of item i-1. allowableError now considers the size of 
> the summary where it should consider the number of items that has been 
> observed.
> In addition, the class currently uses a static sized buffer of size 500. This 
> yields poor runtime performance. Increasing this buffer to size 10000 or more 
> yields much better performance (I'm using it with buffer size of 100K). The 
> buffer size can be dynamic and grow with number of items in the stream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to