This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 2f353933 Automatic Site Publish by Buildbot
2f353933 is described below
commit 2f35393373d27e1c8bf0d236f2d929210a761dc1
Author: buildbot <[email protected]>
AuthorDate: Tue Nov 5 20:34:38 2024 +0000
Automatic Site Publish by Buildbot
---
.../{ThetaSketches.html => ThetaResizeFactor.html} | 157 +++++----------------
output/docs/Theta/ThetaSketches.html | 1 +
2 files changed, 34 insertions(+), 124 deletions(-)
diff --git a/output/docs/Theta/ThetaSketches.html
b/output/docs/Theta/ThetaResizeFactor.html
similarity index 64%
copy from output/docs/Theta/ThetaSketches.html
copy to output/docs/Theta/ThetaResizeFactor.html
index e2fbcc7e..0144522c 100644
--- a/output/docs/Theta/ThetaSketches.html
+++ b/output/docs/Theta/ThetaResizeFactor.html
@@ -322,138 +322,47 @@
specific language governing permissions and limitations
under the License.
-->
-<h2 id="contents">Contents</h2>
-<!-- TOC -->
+<h2 id="theta-sketch-resize-factor">Theta Sketch Resize Factor</h2>
+<p>For Theta Sketches, the <em>Resize Factor</em> is a dynamic, speed
performance vs. memory size tradeoff.</p>
+
+<p>Sketches that are created on-heap and configured with a Resize Factor of
> X1 start out with an internal hash table size that is the smallest
submultiple of the the target Nominal Entries, and larger than the minimum
required hash table size, which is 16.</p>
+
+<p>When the sketch needs to be resized larger the <em>Resize Factor</em> is
used as a multiplier of the current sketch cache array size.</p>
+
+<p><strong>X1</strong> means no resizing is allowed and the sketch will be
intialized at full target size.</p>
+
+<p><strong>X2</strong> means the internal cache will start very small and
double in size until the target size is reached.</p>
+
+<p>Similarly, <strong>X4</strong> is a factor of 4 and <strong>X8</strong> is
a factor of 8.</p>
+
+<p>For example, suppose you configure a sketch with a <em>lgK</em> = 10
(<em>k</em> =1024) and a resize factor as follows:</p>
+
<ul>
- <li><a href="#theta-sketch-framework">Theta Sketch Framework</a></li>
- <li>Theta Examples
- <ul>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ConcurrentThetaSketch.html">Concurrent
Theta Sketch</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaJavaExample.html">Theta
Sketch Java Example</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaSparkExample.html">Theta
Sketch Spark Example</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaPigUDFs.html">Theta
Sketch Pig UDFs</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaHiveUDFs.html">Theta
Sketch Hive UDFs</a></li>
- </ul>
- </li>
- <li>KMV Tutorial
- <ul>
- <li><a
href="https://datasketches.apache.org/docs/Theta/InverseEstimate.html">The
Inverse Estimate</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/KMVempty.html">Empty
Sketch</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/KMVfirstEst.html">First
Estimator</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/KMVbetterEst.html">Better
Estimator</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/KMVrejection.html">Rejection
Rules</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/KMVupdateVkth.html">Update
V(kth) Rule</a></li>
- </ul>
- </li>
- <li>Set Operations and P-sampling
+ <li><strong>X8</strong> (the default):<br />
+Internally the sketch will start out with an internal hash array of size 16
(which corresponds to the smallest allowed value of k). Its memory size is
128+24 bytes.
<ul>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaSketchSetOps.html">Set
Operations</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaSetOpsCornerCases.html">Model
& Test Set Operations</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaPSampling.html">p-Sampling</a></li>
+ <li>When the hash array has filled it will grow the hash array by a
factor of 8, to 1024+24 bytes.</li>
+ <li>When it needs to grow again it will grow by a factor of 8 again, to
819+24 bytes, which is the max size.</li>
+ <li>So the growth sequence of the effective internal k is 16, 128, 1024
for a total of 2 resize operations.</li>
</ul>
</li>
- <li>Accuracy
- <ul>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaAccuracy.html">Basic
Accuracy</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaAccuracyPlots.html">Accuracy
Plots</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html">Relative
Error Table</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaSketchSetOpsAccuracy.html">SetOp
Accuracy</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/AccuracyOfDifferentKUnions.html">Unions
With Different k</a></li>
- </ul>
+ <li>
+ <p><strong>Similarly with X4:</strong><br />
+The growth sequence is 16, 64, 256, 1024 for a total of 3 resize
operations.</p>
</li>
- <li>Size
- <ul>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaSize.html">Theta Sketch
Size</a></li>
- </ul>
- </li>
- <li>Speed
- <ul>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaUpdateSpeed.html">Update
Speed</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaMergeSpeed.html">Merge
Speed</a></li>
- </ul>
- </li>
- <li>Theta Sketch Theory
- <ul>
- <li><a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchFramework.pdf">Theta
Sketch Framework (PDF)</a></li>
- <li><a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchEquations.pdf">Theta
Sketch Equations (PDF)</a></li>
- <li><a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/DataSketches.pdf">Data
Sketches (PDF)</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaConfidenceIntervals.html">Confidence
Intervals; Notes</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaMergingAlgorithm.html">Merging
Algorithm; Notes</a></li>
- <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaReferences.html">Theta
References</a>
-<!-- TOC --></li>
- </ul>
+ <li>
+ <p><strong>Similarly with X2:</strong><br />
+The growth sequence is 16, 32, 64, 128, 256, 512, 1024 for a total of 6 resize
operations.</p>
</li>
+ <li><strong>With X1:</strong><br />
+The growth multiplier is one, so it must start the sketch at full size of 1024
with no resizing operations.</li>
</ul>
-<p><a id="theta-sketch-framework"></a></p>
-<h2 id="theta-sketch-framework">Theta Sketch Framework</h2>
-<p>Theta Sketches are a generalization of the well known <i>K<sup>th</sup>
Minimum Value</i> (KMV) <sup id="fnref:1"><a href="#fn:1"
class="footnote">1</a></sup><sup>,</sup><sup id="fnref:2"><a href="#fn:2"
class="footnote">2</a></sup>
-sketches in that KMV sketches are a form of Theta Sketch, but not all Theta
Sketches are KMV.</p>
-
-<p>The <a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchFramework.pdf">Theta
Sketch Framework</a> (TSF)
-is a mathematical framework
-defined in a multi-stream setting that enables set expressions over these
streams and encompasses many
-different sketching algorithms. A rudimentary introduction to the mathematics
of the simpler sketch algorithms is developed in
-the <a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchEquations.pdf">Theta
Sketch Equations</a> document.</p>
-
-<p>The TSF consists of the following components:</p>
-
-<ol>
- <li>A data type <i>(θ,S)</i>, known as a <i>Theta Sketch</i>, where 0 <
<i>θ</i> < 1 is a
-threshold, and <i>S</i>, the number of entries, is the set of all unique
hashed stream items 0 < <i>x</i> < 1
-that are less than <i>θ</i>.</li>
- <li>A universal “combining function” <i>ThetaUnion</i> that takes as input a
collection of <i>Theta Sketches</i>
-and returns a single <i>Theta Sketch</i> that is a <i>Union</i> of the input
sketches.
-This combining function is extended to set <i>Intersections</i> and
<i>Differences</i> as well.</li>
- <li>A estimator function that takes as input a <i>Theta Sketch</i> and
returns an estimate of the unique
-hashed stream items presented to all the input sketches.</li>
-</ol>
-
-<p>The TSF enables this sketch library to encompass multiple sketching
algorithms including the
-KMV sketch with a common API and greatly simplifies implementation of set
-expressions.</p>
-
-<p>Note that in the KMV sketch the value <i>k</i> is overloaded with multiple
roles:</p>
-
-<ol>
- <li><i>k</i> determines the RSE (accuracy) of the sketch</li>
- <li><i>k</i> determines the upper-bound size of the sketch</li>
- <li><i>k</i> is used as a constant in the estimator and RSE equations</li>
- <li><i>k</i> determines the <i>V(k<sup>th</sup>)</i> threshold, used to
reject/accept hash values into the cache.</li>
-</ol>
-
-<p>By unloading some of these roles, we will gain degrees of freedom to do
some innovative things.</p>
-
-<p>Instead of having to track <i>V(k<sup>th</sup>)</i>, which is a member of
the list of hash values,
-we are going to create a separate threshold variable and call it <i>theta
(θ)</i>.
-This effectively decouples #3 and #4 above from <i>k</i>. When the sketch is
empty <i>θ</i> = 1.0.
-After the sketch has filled with <i>k</i> minimum values <i>θ</i> is still
1.0.
-When the next incoming unique value must be inserted into the sketch the
<i>(k+1)<sup>th</sup></i>
-minimum value, is assigned to <i>θ</i> and removed from the cache.<sup
id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
-
-<p>Ultimately, it will be the size of <i>S</i>, <i>|S|</i>, that will
determine the stored size of a
-sketch, which decouples #2 above from the value <i>k</i>.
-The <i>Nominal Entries</i> or <i>k</i> is a <i>user specified, configuration
parameter</i>,
-which is used by the software to determine the target accuracy of the sketch
and the maximum size of the sketch.</p>
-
-<p>The unbiased estimate simplifies to |S|/<i>θ</i>, which is just the size of
<i>S</i> divided by <i>θ</i>.
-We will discuss the RSE in a later section.</p>
-
-<p><img class="doc-img-full"
src="https://datasketches.apache.org/docs/img/theta/ThetaSketch1.png"
alt="ThetaSketch1" /></p>
-
-<div class="footnotes">
- <ol>
- <li id="fn:1">
- <p>Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan.
Counting distinct elements in a data stream. In <i>Randomization and
Approximation Techniques in Computer Science</i>, pages 1–10. Springer, 2002.
<a href="#fnref:1" class="reversefootnote">↩</a></p>
- </li>
- <li id="fn:2">
- <p>See <a href="/docs/Theta/InverseEstimate.html">KMV Tutorial</a> for a
brief tutorial on KMV Sketches. <a href="#fnref:2"
class="reversefootnote">↩</a></p>
- </li>
- <li id="fn:3">
- <p>This is a limited “KMV perspective” on how <i>θ</i> gets assigned.
The attached paper <a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchFramework.pdf">Theta
Sketch Framework</a> presents multiple ways that <i>θ</i> can be assigned
using the <i>Theta Choosing Function (TCF)</i>. Different sketch algorithms
have different TCFs. <a href="#fnref:3" class="reversefootnote">↩</a></p>
- </li>
- </ol>
-</div>
+<p>Resizing the internal hash array is an expensive operation, relatively
speaking, so if real-time performance is critical you might set the resize
factor to <strong>X1</strong>, but you pay in memory usage up front with a
sketch that is always full size.</p>
+
+<p>If you want to minimize memory usage then you want the sketch to stay as
small as possible and let it grow only as required, so you would use
<strong>X2</strong>, but the real-time performance will be a little slower
because of the resizing operations.</p>
+
+<p><strong>X4</strong> and <strong>X8</strong> are in between
<strong>X1</strong> and <strong>X2</strong>.</p>
</div> <!-- End content -->
</div> <!-- End row -->
diff --git a/output/docs/Theta/ThetaSketches.html
b/output/docs/Theta/ThetaSketches.html
index e2fbcc7e..9acd30b3 100644
--- a/output/docs/Theta/ThetaSketches.html
+++ b/output/docs/Theta/ThetaSketches.html
@@ -364,6 +364,7 @@
<li>Size
<ul>
<li><a
href="https://datasketches.apache.org/docs/Theta/ThetaSize.html">Theta Sketch
Size</a></li>
+ <li><a
href="https://datasketches.apache.org/docs/Theta/ThetaResizeFactor.html">Theta
Resize Factor</a></li>
</ul>
</li>
<li>Speed
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]