leerho commented on code in PR #706: URL: https://github.com/apache/datasketches-java/pull/706#discussion_r2656120838
########## src/main/javadoc/resources/dictionary.html: ########## @@ -143,95 +143,95 @@ <h3><a name="numStdDev">Number of Standard Deviations</a></h3> getLowerBound(3) returns the estimated quantile(0.0013) of the distribution.<br> </p> -<p>However, for sketches with small configured values of <i>Nominal Entries < 4096</i> for Theta or <i>lgConfigK < 12</i> for HLL, -the error distribution of the sketch becomes quite asymmetric and cannot be approximated with a Gaussian. In these cases the interpretation of -<i>numStdDev</i> is that of an index that returns the quantile of the sketch error distribution that corresponds to fractional normalized rank +<p>However, for sketches with small configured values of <i>Nominal Entries < 4096</i> for Theta or <i>lgConfigK < 12</i> for HLL, +the error distribution of the sketch becomes quite asymmetric and cannot be approximated with a Gaussian. In these cases the interpretation of +<i>numStdDev</i> is that of an index that returns the quantile of the sketch error distribution that corresponds to fractional normalized rank of the standard normal distribution at the specified <i>numStdDev</i>. -<p>Thus, getUpperBound(1) and getLowerBound(2) represent the 68.3% confidence bounds, +<p>Thus, getUpperBound(1) and getLowerBound(2) represent the 68.3% confidence bounds, getUpperBound(2) and getLowerBound(2) represent the 95.4% confidence bounds, and getUpperBound(3) and getLowerBound(3) represent the 99.7% confidence bounds. <br> -<p>For some sketches where the error distribution is not Gaussian, special mathematical approximation methods are used. +<p>For some sketches where the error distribution is not Gaussian, special mathematical approximation methods are used. See <a href="#accuracy">Sketch Accuracy</a>.</p> <h3><a name="quickSelectTCF">Quick Select TCF</a></h3> The fundamental Theta Sketch QuickSelect algorithm is described in classic algorithm texts by Sedgewick and is the Theta Choosing Function (<a href="#tcf">TCF</a>) for the QuickSelect Sketches. -When the internal hash table of the sketch reaches its internal -<i>refresh threshold</i>, -the quick select algorithm is used to select the <code>(k+1)th order statistic</code> -from the hash table with a complexity of <i>O(n)</i>. -The value of the selected hash becomes the new -<a href="#thetaLong">Theta Long</a> -and immediately makes some number of entries in the table +When the internal hash table of the sketch reaches its internal +<i>refresh threshold</i>, +the quick select algorithm is used to select the <code>(k+1)th order statistic</code> +from the hash table with a complexity of <i>O(n)</i>. +The value of the selected hash becomes the new +<a href="#thetaLong">Theta Long</a> +and immediately makes some number of entries in the table <a href="#dirtyHash">dirty</a>. -The <i>rebuild()</i> method is called that rebuilds the hash table removing the +The <i>rebuild()</i> method is called that rebuilds the hash table removing the <a href="#dirtyHash">dirty</a> values. Since the value of <a href="#thetaLong">Theta Long</a> -is only changed when the hash table needs to be rebuilt, -the values in the hash table are only ever <a href="#dirtyHash">dirty</a> -briefly during the rebuild process. +is only changed when the hash table needs to be rebuilt, +the values in the hash table are only ever <a href="#dirtyHash">dirty</a> +briefly during the rebuild process. Thus, all the values in the hash table are always <a href="#validHash">valid</a> during normal updating of the sketch. <p>One of the benefits of using the QuickSelect algorithm for the cache management of the sketch is -that the number of <a href="#validHash">valid</a> hashes ranges from -<a href="#nomEntries">nominal entries</a> -to the current <i>REBUILD_THRESHOLD</i></a>, which is nominally 15/16 * <i>cacheSize</i>. -This means that without the user forcing -a <i>rebuild()</i>, the sketch, on average, may be about 50% larger than +that the number of <a href="#validHash">valid</a> hashes ranges from +<a href="#nomEntries">nominal entries</a> +to the current <i>REBUILD_THRESHOLD</i></a>, which is nominally 15/16 * <i>cacheSize</i>. +This means that without the user forcing +a <i>rebuild()</i>, the sketch, on average, may be about 50% larger than <a href="#nomEntries">nominal entries</a>, about 19% more accurate, and faster.</p> <h3><a name="resizeFactor">Resize Factor</a></h3> For Theta Sketches, the Resize Factor is a dynamic, speed performance vs. memory size tradeoff. The sketches created on-heap and configured with a Resize Factor of > X1 start out with -an internal hash table size that is the smallest submultiple of the the target -<a href="#nomEntries">Nominal Entries</a> -and larger than the minimum required hash table size for that sketch. +an internal hash table size that is the smallest submultiple of the the target +<a href="#nomEntries">Nominal Entries</a> +and larger than the minimum required hash table size for that sketch. When the sketch needs to be resized larger, then the Resize Factor is used as a multiplier of -the current sketch cache array size. <br> -"X1" means no resizing is allowed and the sketch will be intialized at full size.<br> +the current sketch cache array size. <br> +"X1" means no resizing is allowed and the sketch will be initialized at full size.<br> "X2" means the internal cache will start very small and double in size until the target size is reached.<br> -Similarly, "X4" is a factor of 4 and "X8 is a factor of 8. +Similarly, "X4" is a factor of 4 and "X8" is a factor of 8. Review Comment: Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
