Rafi Shachar created HBASE-14324:
------------------------------------

             Summary: MetricSampleQuantiles maintains quantiles summary in a 
wrong way
                 Key: HBASE-14324
                 URL: https://issues.apache.org/jira/browse/HBASE-14324
             Project: HBase
          Issue Type: Bug
          Components: metrics
            Reporter: Rafi Shachar


The MetricSampleQuantiles computes quantiles estimations for data stream 
according to a paper published by CKMS in 2005. However the implementation is 
incorrect compared to the paper. In particular, the insert and compact methods 
use the rank of an item in the summary whereas the rank should be the estimated 
rank in the stream. Due to this bug the resulting summary is much larger than 
required: according to my experiments it is more than 10 times larger. 
Furthermore, the summary size is not stable while the distribution doesn't 
change. When the number of items continue to grow to the tens of millions the 
summary size is about 200K while it should be in the range of few thousands. 
This has significant on performance. I didn't see significant effect on 
accuracy.

In insert batch and compress methods, call allowableError() passing the rank of 
the item in the summary. It actually should pass the estimated rank in the 
stream. In the CKMS paper this rank is denoted by r_i, which more precisely is 
the estimated rank of item i-1. allowableError now considers the size of the 
summary where it should consider the number of items that has been observed.

In addition, the class currently uses a static sized buffer of size 500. This 
yields poor runtime performance. Increasing this buffer to size 10000 or more 
yields much better performance (I'm using it with buffer size of 100K). The 
buffer size can be dynamic and grow with number of items in the stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to