This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 4e19883  Automatic Site Publish by Buildbot
4e19883 is described below

commit 4e1988342639f9eb52e85b4ec0cabc4031aa76d8
Author: buildbot <[email protected]>
AuthorDate: Thu Dec 16 01:36:32 2021 +0000

    Automatic Site Publish by Buildbot
---
 output/docs/Theta/ThetaSetOpsCornerCases.html | 126 ++++++++++++--------------
 1 file changed, 59 insertions(+), 67 deletions(-)

diff --git a/output/docs/Theta/ThetaSetOpsCornerCases.html 
b/output/docs/Theta/ThetaSetOpsCornerCases.html
index 4d5b981..590639d 100644
--- a/output/docs/Theta/ThetaSetOpsCornerCases.html
+++ b/output/docs/Theta/ThetaSetOpsCornerCases.html
@@ -511,7 +511,9 @@
 
 <h1 id="theta-sketch-and-tuple-sketch-set-operation-corner-cases">Theta Sketch 
and Tuple Sketch Set Operation Corner Cases</h1>
 
-<p>The <em>TupleSketch</em> is an extension of the <em>ThetaSketch</em> and 
both are part of the <em>Theta Sketch Framework</em><sup>1</sup>. In this 
document, the term <em>Theta</em> (upper case) when referencing sketches will 
refer to both the <em>ThetaSketch</em> and the <em>TupleSketch</em>.  This is 
not to be confused with the term <em>theta</em> (lower case), which refers to 
the sketch variable that tracks the sampling probability of the sketch.</p>
+<p>The <em>TupleSketch</em> is an extension of the <em>ThetaSketch</em> and 
both are part of the <em>Theta Sketch Framework</em><sup>1</sup>. 
+In this document, the term <em>Theta</em> (upper case) when referencing 
sketches will refer to both the <em>ThetaSketch</em> and the 
<em>TupleSketch</em>.<br />
+This is not to be confused with the term <em>theta</em> (lower case), which 
refers to the sketch variable that tracks the sampling probability of the 
sketch.</p>
 
 <p>Because Theta sketches provide the set operations of <em>intersection</em> 
and <em>difference</em> (<em>A and not B</em> or just <em>A not B</em>), a 
number of corner cases arise that require some analysis to determine how the 
code should handle them.</p>
 
@@ -533,7 +535,8 @@
   </li>
 </ul>
 
-<p>We have developed a shorthand notation for these three variables to record 
their state as <em>{theta, retained entries, empty}</em>. When analyzing the 
corner cases of the set operations, we only need to know whether <em>theta</em> 
is 1.0 or less than 1.0, <em>retained entries</em> is zero or greater than 
zero, and <em>empty</em> is true or false. These are further abbreviated as</p>
+<p>We have developed a shorthand notation for these three variables to record 
their state as <em>{theta, retained entries, empty}</em>. 
+When analyzing the corner cases of the set operations, we only need to know 
whether <em>theta</em> is 1.0 or less than 1.0, <em>retained entries</em> is 
zero or greater than zero, and <em>empty</em> is true or false. These are 
further abbreviated as</p>
 
 <ul>
   <li><em>theta</em> can be <em>1.0</em> or <em>&lt;1.0</em></li>
@@ -541,42 +544,54 @@
   <li><em>empty</em> can be either <em>T</em> or <em>F</em></li>
 </ul>
 
-<p>Each of the above three states can be represented as a boolean variable. 
Thus, there are 8 possible combinations of the three variables.</p>
+<p>Each of the above three states can be represented as a boolean variable. 
+Thus, there are 8 possible combinations of the three variables.</p>
 
 <hr />
 
-<p><sup>1</sup> Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin 
Thaler. A framework for estimating stream expression cardinalities. In 
*EDBT/ICDT Proceedings ‘16 *, pages 6:1–6:17, 2016.</p>
+<p><sup>1</sup> Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin 
Thaler. A framework for estimating stream expression cardinalities. In 
<em>EDBT/ICDT Proceedings 2016</em>, pages 6:1–6:17, 2016.</p>
 
 <h2 id="valid-states-of-a-sketch">Valid States of a Sketch</h2>
 
 <p>Of the eight possible combinations of the three boolean variables and using 
the above notation, there are four valid states of a <em>Theta</em> sketch.</p>
 
 <h3 id="empty10-0-t">Empty{1.0, 0, T}</h3>
-<p>When a new sketch is created, <em>theta</em> is set to 1.0, <em>retained 
entries</em> is set to zero, and <em>empty</em> is true. This state can also 
occur as the result of a set operation, where the operation creates a new 
sketch to potentially load result data into the sketch but there is no data to 
load into the sketch. So it effectively returns a new empty sketch that has 
been untouched and unaffected by the input arguments to the set operation.</p>
+<p>When a new sketch is created, <em>theta</em> is set to 1.0, <em>retained 
entries</em> is set to zero, and <em>empty</em> is true. 
+This state can also occur as the result of a set operation, where the 
operation creates a new sketch to potentially load result data into the sketch 
but there is no data to load into the sketch. 
+So it effectively returns a new empty sketch that has been untouched and 
unaffected by the input arguments to the set operation.</p>
 
 <h3 id="exact10-0-f">Exact{1.0, &gt;0, F}</h3>
-<p>All of the <em>Theta</em> sketches have an internal buffer that is 
effectively a list of hash values of the items received by the sketch. If the 
number of distinct input items does not exceed the size of that buffer, the 
sketch is in <em>exact</em> mode. There is no probabilistic estimation involved 
so <em>theta = 1.0</em>, which indicates that all distinct values are in the 
buffer. <em>retained entries</em> is the count of those values in the buffer, 
and the sketch is not <em>empty</ [...]
+<p>All of the <em>Theta</em> sketches have an internal buffer that is 
effectively a list of hash values of the items received by the sketch. 
+If the number of distinct input items does not exceed the size of that buffer, 
the sketch is in <em>exact</em> mode. 
+There is no probabilistic estimation involved so <em>theta = 1.0</em>, which 
indicates that all distinct values are in the buffer. 
+<em>retained entries</em> is the count of those values in the buffer, and the 
sketch is not <em>empty</em>.</p>
 
 <h3 id="estimation10-0-f">Estimation{&lt;1.0, &gt;0, F}</h3>
 <p>Here, the number of distinct inputs to the sketch have exceeded the size of 
the buffer, so the sketch must start choosing what values to retain in the 
sketch and starts reducing the value of <em>theta</em> accordingly. <em>theta 
&lt; 1.0</em>, <em>retained entries &gt; 0</em>, and <em>empty = F</em>.</p>
 
-<h3 id="degenerate10-0-f">Degenerate{&lt;1.0, 0, F}</h3>
+<h3 id="degenerate10-0-f2">Degenerate{&lt;1.0, 0, F}<sup>2</sup></h3>
 <p>This requires some explanation.</p>
 
-<p>Imagine we have two large data sets, A and B, with only a few items in 
common. The exact intersection of these two sets, <em>A∩B</em> would result
-in those few common items.</p>
+<p>Imagine we have two large data sets, A and B, with only a few items in 
common. 
+The exact intersection of these two sets, <em>A∩B</em> would result in those 
few common items.</p>
 
-<p>Now suppose we compute Sketch(A) and Sketch(B). Because sketches are 
approximate and the items from each set are chosen at random, 
-there is some probability that one of the sketches may not contain any of the 
common items. 
+<p>Now suppose we compute Sketch(A) and Sketch(B). 
+Because sketches are approximate and the items from each set are chosen at 
random, there is some probability that one of the sketches may not contain any 
of the common items. 
 As a result, the sketch intersection of these two sets, 
<em>Sketch(A)∩Sketch(B)</em>, which is also approximate, might contain zero 
retained entries. 
-Even though the retained entries is zero, the upper bound of the estimated 
number of distinct values from the input domain is clearly greater than zero, 
but missed by the sketch intersection. This upper bound can be computed 
statistically. It is too complex to discuss further here, but the sketch code 
actually performs this estimation.</p>
-
-<p>Where both input sketches are non-empty, there is a non-zero probability 
that the intersection will have zero entries, yet the statistics tells us that 
the result may
-not be really empty, we may have been just unlucky.  We indicate this by 
setting the result <em>empty = F</em>, and <em>retained entries = 0</em>. The 
resulting <em>theta = min(thetaA, thetaB)</em>. 
+Even though the retained entries are zero, the upper bound of the estimated 
number of distinct values from the input domain is clearly greater than zero, 
but missed by the sketch intersection. 
+This upper bound can be computed statistically. 
+It is too complex to discuss further here, but the sketch code actually 
performs this estimation.</p>
+
+<p>Where both input sketches are non-empty, there is a non-zero probability 
that the intersection will have zero entries, yet the statistics tell us that 
the result may
+not be really empty, we may have been just unlucky.<br />
+We indicate this by setting the result <em>empty = F</em>, and <em>retained 
entries = 0</em>. 
+The resulting <em>theta = min(thetaA, thetaB)</em>. 
 Calling <em>getUpperBound(…)</em> on the resulting intersection will reveal 
the best estimate of how many values might exist in the intersection of the raw 
data. 
 The <em>getLowerBound(…)</em> will be zero because it is also possible that 
the two sets, A and B, were exactly disjoint.</p>
 
-<p>Note that this degenerate state can also result from an AnotB operation or 
the Union operation, which will be demonstrated below.</p>
+<hr />
+
+<p><sup>2</sup>Note that this degenerate state can also result from an AnotB 
operation or the Union operation, which will be discussed below.</p>
 
 <h3 id="summary-table-of-the-valid-states-of-a-sketch">Summary Table of the 
Valid States of a Sketch</h3>
 <p>The <em>Has Seen Data</em> column is not an independent variable, but helps 
with the interpretation of the state.</p>
@@ -620,7 +635,7 @@ The <em>getLowerBound(…)</em> will be zero because it is 
also possible that th
       <td style="text-align: center">F</td>
       <td style="text-align: center">T</td>
       <td style="text-align: center">6</td>
-      <td style="text-align: left">Exact Mode</td>
+      <td style="text-align: left">Exact Mode Sketch</td>
     </tr>
     <tr>
       <td style="text-align: center">Estimation<br />{&lt;1.0,&gt;0,F}</td>
@@ -629,7 +644,7 @@ The <em>getLowerBound(…)</em> will be zero because it is 
also possible that th
       <td style="text-align: center">F</td>
       <td style="text-align: center">T</td>
       <td style="text-align: center">2</td>
-      <td style="text-align: left">Estimation Mode</td>
+      <td style="text-align: left">Estimation Mode Sketch</td>
     </tr>
     <tr>
       <td style="text-align: center">Degenerate<br 
/>{&lt;1.0,0,F}<sup>3</sup></td>
@@ -638,61 +653,33 @@ The <em>getLowerBound(…)</em> will be zero because it is 
also possible that th
       <td style="text-align: center">F</td>
       <td style="text-align: center">T</td>
       <td style="text-align: center">0</td>
-      <td style="text-align: left">Valid Intersect<br />or AnotB result</td>
+      <td style="text-align: left">Degenerate and valid<br />Intersect or 
AnotB result</td>
     </tr>
   </tbody>
 </table>
 
 <hr />
 
-<p><sup>3</sup> <em>Degenerate</em>: Can appear as a result of a an 
Intersection or AnotB of certain combination of sketches.</p>
+<p><sup>3</sup> <em>Degenerate</em>: This can occur as an estimating result of 
a an Intersection of two disjoint sets, 
+an AnotB of two identical sets, or the Union of two <em>Degenerate</em> 
sets.</p>
 
 <h2 id="invalid-states-of-a-sketch">Invalid States of a Sketch</h2>
 <p>The remaining four combinations of the variables are invalid and should not 
occur.</p>
 
 <p>The <em>Has Seen Data</em> column is not an independent variable, but helps 
with the interpretation of the state.</p>
 
-<table>
-  <thead>
-    <tr>
-      <th style="text-align: center">Theta</th>
-      <th style="text-align: center">Retained<br />Entries</th>
-      <th style="text-align: center">Empty<br />Flag</th>
-      <th style="text-align: center">Has Seen<br />Data</th>
-      <th style="text-align: left">Comments</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td style="text-align: center">1.0</td>
-      <td style="text-align: center">0</td>
-      <td style="text-align: center">F</td>
-      <td style="text-align: center">T</td>
-      <td style="text-align: left">If it has seen data Empty = F. <br />∴ 
Theta cannot be = 1.0 AND Entries = 0</td>
-    </tr>
-    <tr>
-      <td style="text-align: center">1.0</td>
-      <td style="text-align: center">&gt;0</td>
-      <td style="text-align: center">T</td>
-      <td style="text-align: center">F</td>
-      <td style="text-align: left">If it has not seen data Empty = T. <br />∴ 
Entries cannot be &gt; 0</td>
-    </tr>
-    <tr>
-      <td style="text-align: center">&lt;1.0</td>
-      <td style="text-align: center">&gt;0</td>
-      <td style="text-align: center">T</td>
-      <td style="text-align: center">F</td>
-      <td style="text-align: left">If it has not seen data, Empty = T. <br />∴ 
Theta cannot be &lt; 1.0 OR Entries &gt; 0</td>
-    </tr>
-    <tr>
-      <td style="text-align: center">&lt;1.0</td>
-      <td style="text-align: center">0</td>
-      <td style="text-align: center">T</td>
-      <td style="text-align: center">F</td>
-      <td style="text-align: left">If it has not seen data, Empty = T. <br />∴ 
Theta cannot be &lt; 1.0</td>
-    </tr>
-  </tbody>
-</table>
+<p>| Theta | Retained<br />Entries | Empty<br />Flag | Has Seen<br />Data | 
Comments                                                                        
               |
+|:—–:|:——————-:|:————-:|:—————-:|:———————————————————————————————–|
+|  1.0  |        0            |      F        |       T          | If it has 
seen data, Empty = F.<sup>4</sup> <br />∴ Theta cannot be = 1.0 AND Entries = 0 
|
+|  1.0  |       &gt;0            |      T        |       F          | If it 
has not seen data, Empty = T. <br />∴ Entries cannot be &gt; 0                  
       |
+| &lt;1.0  |       &gt;0            |      T        |       F          | If it 
has not seen data, Empty = T. <br />∴ Theta cannot be &lt; 1.0 OR Entries &gt; 
0          |
+| &lt;1.0  |        0            |      T        |       F          | If it 
has not seen data, Empty = T.<sup>5</sup> <br />∴ Theta cannot be &lt; 1.0      
       |
+—
+<sup>4</sup>This can occur internally as the result from an intersection of 
two exact, disjoint sets, or AnotB of two exact, identical sets.
+There is no probability distribution, so this is converted internally to EMPTY 
{1.0, 0, T}. A Union cannot produce this result.</p>
+
+<p><sup>5</sup>This can occur internally as the initial state of an 
UpdateSketch if p was set to less than 1.0 by the user and the sketch has not 
seen any data.
+There is no probability distribution because the sketch has not been offered 
any data, so this is converted internally to EMPTY {1.0, 0, T}.</p>
 
 <h2 id="state-combinations-of-two-sketches-and-set-operation-results">State 
Combinations of Two Sketches and Set Operation Results</h2>
 <p>Each sketch can have four valid states, which means we can have 16 
combinations of states of two sketches as expanded in the following table.</p>
@@ -971,14 +958,19 @@ The <em>getLowerBound(…)</em> will be zero because it is 
also possible that th
 </ul>
 
 <h2 id="testing">Testing</h2>
-<p>The above information is encoded as a model into the special class <em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.SetOperationCornerCases.java";>org.apache.datasketches.SetOperationsCornerCases</a></em>.
 This class is made up of enums and static methods to quickly determine for a 
sketch what actions to take based on the state of the input arguments. This 
model is independent of the implementation of the Theta Sketch, whether t [...]
+<p>The above information is encoded as a model into the special class 
+<em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.SetOperationCornerCases.java";>org.apache.datasketches.SetOperationsCornerCases</a></em>.
 
+This class is made up of enums and static methods to quickly determine for a 
sketch what actions to take based on the state of the input arguments. 
+This model is independent of the implementation of the Theta Sketch, whether 
the set operation is performed as a Theta Sketch, or a Tuple Sketch and when 
translated can be used in other languages as well.</p>
 
-<p>Before this model was put to use an extensive set of tests was designed to 
test any potential implementation against this model. These tests are slightly 
different for the Tuple Sketch than the Theta Sketch because the Tuple Sketch 
has more combinations to test, but the model is the same.</p>
+<p>Before this model was put to use an extensive set of tests was designed to 
test any potential implementation against this model. 
+These tests are slightly different for the Tuple Sketch than the Theta Sketch 
because the Tuple Sketch has more combinations to test, but the model is the 
same.</p>
 
-<ul>
-  <li>The tests for the Theta Sketch can be found in the class <em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.theta.CornerCaseThetaSetOperationsTest.java";>org.apache.datasketches.theta.CornerCaseThetaSetOperationsTest</a></em></li>
-  <li>The tests for the Tuple Sketch can be found in the class <em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.tuple.aninteger.CornerCaseTupleSetOperationsTest.java";>org.apache.datasketches.tuple.aninteger.CornerCaseTupleSetOperationsTest</a></em></li>
-</ul>
+<p>The tests for the Theta Sketch can be found in the class 
+<em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.theta.CornerCaseThetaSetOperationsTest.java";>org.apache.datasketches.theta.CornerCaseThetaSetOperationsTest</a></em></p>
+
+<p>The tests for the Tuple Sketch can be found in the class 
+<em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.tuple.aninteger.CornerCaseTupleSetOperationsTest.java";>org.apache.datasketches.tuple.aninteger.CornerCaseTupleSetOperationsTest</a></em></p>
 
 <p>The details of how this model is used in run-time code can be found in the 
class <em><a 
href="https://github.com/apache/datasketches-java/blob/master/src/main/java/org.apache.datasketches.tuple.AnotB.java";>org.apache.datasketches.tuple.AnotB.java</a></em>.</p>
 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to