[ https://issues.apache.org/jira/browse/HBASE-14324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Kyle Purtell closed HBASE-14324. --------------------------------------- > MetricSampleQuantiles maintains quantiles summary in a wrong way > ---------------------------------------------------------------- > > Key: HBASE-14324 > URL: https://issues.apache.org/jira/browse/HBASE-14324 > Project: HBase > Issue Type: Bug > Components: metrics > Reporter: Rafi Shachar > Priority: Major > > The MetricSampleQuantiles computes quantiles estimations for data stream > according to a paper published by CKMS in 2005. However the implementation is > incorrect compared to the paper. In particular, the insert and compact > methods use the rank of an item in the summary whereas the rank should be the > estimated rank in the stream. Due to this bug the resulting summary is much > larger than required: according to my experiments it is more than 10 times > larger. Furthermore, the summary size continues to grow with stream size > while the distribution doesn't change. When the number of items is in the > tens of millions the summary size is about 200K while it should be in the > range of few thousands. This has significant effect on performance. I didn't > see significant effect on accuracy. > The insert batch and compress methods call allowableError() passing the rank > of the item in the summary. It actually should pass the estimated rank in the > stream. In the CKMS paper this rank is denoted by r_i, which more precisely > is the estimated rank of item i-1. allowableError now considers the size of > the summary where it should consider the number of items that has been > observed. > In addition, the class currently uses a static sized buffer of size 500. This > yields poor runtime performance. Increasing this buffer to size 10000 or more > yields much better performance (I'm using it with buffer size of 100K). The > buffer size can be dynamic and grow with number of items in the stream. -- This message was sent by Atlassian Jira (v8.20.10#820010)