[
https://issues.apache.org/jira/browse/HBASE-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-14324.
-----------------------------------------
Resolution: Incomplete
> MetricSampleQuantiles maintains quantiles summary in a wrong way
> ----------------------------------------------------------------
>
> Key: HBASE-14324
> URL: https://issues.apache.org/jira/browse/HBASE-14324
> Project: HBase
> Issue Type: Bug
> Components: metrics
> Reporter: Rafi Shachar
> Priority: Major
>
> The MetricSampleQuantiles computes quantiles estimations for data stream
> according to a paper published by CKMS in 2005. However the implementation is
> incorrect compared to the paper. In particular, the insert and compact
> methods use the rank of an item in the summary whereas the rank should be the
> estimated rank in the stream. Due to this bug the resulting summary is much
> larger than required: according to my experiments it is more than 10 times
> larger. Furthermore, the summary size continues to grow with stream size
> while the distribution doesn't change. When the number of items is in the
> tens of millions the summary size is about 200K while it should be in the
> range of few thousands. This has significant effect on performance. I didn't
> see significant effect on accuracy.
> The insert batch and compress methods call allowableError() passing the rank
> of the item in the summary. It actually should pass the estimated rank in the
> stream. In the CKMS paper this rank is denoted by r_i, which more precisely
> is the estimated rank of item i-1. allowableError now considers the size of
> the summary where it should consider the number of items that has been
> observed.
> In addition, the class currently uses a static sized buffer of size 500. This
> yields poor runtime performance. Increasing this buffer to size 10000 or more
> yields much better performance (I'm using it with buffer size of 100K). The
> buffer size can be dynamic and grow with number of items in the stream.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)