This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new b496e0c  Automatic Site Publish by Buildbot
b496e0c is described below

commit b496e0c3bebb865707e9e6e23bdbdb16a1ddadaa
Author: buildbot <[email protected]>
AuthorDate: Sat Oct 30 05:53:14 2021 +0000

    Automatic Site Publish by Buildbot
---
 output/docs/Theta/ThetaSetOpsCornerCases.md | 143 ++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

diff --git a/output/docs/Theta/ThetaSetOpsCornerCases.md 
b/output/docs/Theta/ThetaSetOpsCornerCases.md
new file mode 100644
index 0000000..fdb1e89
--- /dev/null
+++ b/output/docs/Theta/ThetaSetOpsCornerCases.md
@@ -0,0 +1,143 @@
+# Theta Sketch and Tuple Sketch Set Operation Corner Cases
+
+The *TupleSketch* is an extension of the *ThetaSketch* and both are part of 
the *Theta Sketch Framework*<sup>1</sup>. In this document, the term *Theta* 
(upper case) when referencing sketches will refer to both the *ThetaSketch* and 
the *TupleSketch*.  This is not to be confused with the term *theta* (lower 
case), which refers to the sketch variable that tracks the sampling probability 
of the sketch.
+
+Because Theta sketches provide the set operations of *intersection* and 
*difference* (*A and not B* or just *A not B*), a number of interesting corner 
cases arise that require some analysis to determine how the code should handle 
them.
+
+Theta sketches track three key variables in addition to retained data:
+
+* *theta*: This is the current sampling probability of the sketch and 
mathematically expressed as a 64-bit, double floating value between 0.0 and 
1.0. However, internally in the sketch, this value is expressed as a 64-bit, 
signed, long integer (usually identified as *thetaLong* in the code), where the 
maximum positive value (*Long.MAX_VALUE*) is interpreted as the double 1.0. In 
this document we will only refer to the mathematical quantity *theta*.
+
+* *retained entries* or *count*: This is the number of hash values currently 
retained in the sketch. It can never be less than zero.
+
+* *empty*: 
+    * By definition, if *empty = true*, the number of *retained entries* must 
be zero.  However, the value of *theta* can be 1.0 or less-than 1.0.
+    * If *empty* = false, the *retained entries* can be zero or greater than 
zero, and *theta* can be 1.0 or less than 1.0.
+
+We have developed a short hand notation for these three variables to record 
their state as *{theta, retained entries, empty}*. When analyzing the corner 
cases of the set operations, we only need to know whether *theta* is 1.0 or 
less than 1.0, *retained entries* is zero or greater than zero, and *empty* is 
true or false.  These are further abbreviated as
+
+* *theta* can be *1.0* or *<1.0*
+* *retained entries* can be either *0* or *>0*
+* *empty* can be either *T* or *F*
+
+Each of the above three variables can be represented as boolean variable. 
Thus, there are 8 possible combinations of the three variables.
+
+<sup>1</sup> Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A 
framework for estimating stream expression cardinalities. In *EDBT/ICDT 
Proceedings ‘16 *, pages 6:1–6:17, 2016.
+
+## Valid States of a Sketch
+
+Of the eight possible combinations of the three variables and using the above 
notation, there are five valid states of a *Theta* sketch.
+
+### New{1.0, 0, T}
+When a new sketch is created, *theta* is set to 1.0, *retained entries* is set 
to zero, and *empty* is true.  This state can also occur as the result of a set 
operation, where the operation creates a new sketch to potentially load result 
data into the sketch but there is no data to load into the sketch.  So it 
effectively returns a new sketch that has been untouched and unaffected by the 
input arguments to the set operation.
+
+### Exact{1.0, >0, F}
+All of the *Theta* sketches have an input buffer that is effectively a list of 
items received by the sketch. If the number of unique input values does not 
exceed the size of that buffer, the sketch is in *exact* mode.  There is no 
probabilistic estimation involved so *theta = 1.0*, which indicates that all 
unique values presented to the sketch are in the buffer. *retained entries* is 
the count of those values in the buffer, and the sketch is clearly not *empty*.
+
+### Estimation{<1.0, >0, F}
+Here, the number of inputs to the sketch have exceeded the size of the input 
buffer, so the sketch must start choosing what values to retain in the sketch 
and starts reducing the value of *theta* accordingly.  *theta < 1.0*, *retained 
entries > 0*, and *empty = F*.
+
+### NewDegen{<1.0, 0, T}
+This is a new sketch where the user has set the sampling probability, *p < 
1.0* and the sketch has not been presented any data.  Internally at 
initialization, *theta* is set to *p*, so if *p = 0.5*, *theta* will be set to 
*0.5*. Since the sketch has not seen any data, *retained entries = 0* and 
*empty = T*.  This is degenerative form of a new sketch, thus its name.
+
+### ResultDegen{<1.0, 0, F}
+This requires some explanation.  Imagine the intersection of two estimating 
sketches where the values retained in the two sketches are disjoint (i.e, no 
overlap).  Since the two sketches chose their internal values at random, there 
remains some probability that there could be common values in an exactly 
computed intersection, but it just so happens that one of the two sketches did 
not select any of them in the random sampling process.  Therefore, the 
*retained entries = 0*. The value *1. [...]
+
+### Summary Table of the Valid States of a Sketch
+The *Has Seen Data* column is not an independent variable, but helps with the 
interpretation of the state.
+
+We can assign a single octal digit ID to each state where
+
+* *theta = 1.0 := 4, else 0*
+* *retained entries >0 := 2, else 0*
+* *empty = true := 1, else 0* 
+
+| Shorthand Notation                | theta | retained entries |    empty   | 
Has Seen Data | ID | Comments                       |
+|:---------------------------------:|:-----:|:----------------:|:----------:|:-------------:|:--:|:------------------------------:|
+| New {1.0,0,T}                     |  1.0  |         0        |     T      |  
     F       |  5 | New Sketch, p=1.0 (default)    |
+| Exact {1.0,>0,F}                  |  1.0  |        >0        |     F      |  
     T       |  6 | Exact Mode                     |
+| Estimation {<1.0,>0,F}            | <1.0  |        >0        |     F      |  
     T       |  2 | Estimation Mode                |
+| NewDegen {<1.0,0,T}<sup>2</sup>   | <1.0  |         0        |     T      |  
     F       |  1 | New Sketch, user sets p<1.0    |
+| ResultDegen {<1.0,0,F}<sup>3</sup>| <1.0  |         0        |     F      |  
     T       |  0 | Valid Intersect or AnotB result   |
+
+<sup>2</sup> *New Degenerate*: New Empty Sketch where the user sets *p < 1.0*. 
This can be safely reinterpreted as {1.0,0,T} because it has not seen any 
data.<br>
+<sup>3</sup> *Result Degenerate*: Can appear as a result of a an Intersection 
or AnotB of certain combination of sketches.
+
+## Invalid States of a Sketch
+The remaining three combinations of the variables represent internal errors 
and should not occur. 
+The *Has Seen Data* column is not an independent variable, but helps with the 
interpretation of the state.
+
+| Theta | Retained Entries | Empty Flag | Has Seen Data | Comments             
                              |
+|:-----:|:----------------:|:----------:|:-------------:|:--------------------------------------------------:|
+|  1.0  |        0         |      T     |       T       | If it has seen data, 
Theta != 1.0 AND Entries = 0. |
+|  1.0  |       >0         |      F     |       F       | If it has not seen 
data, Entries !> 0.             |
+| <1.0  |       >0         |      F     |       F       | If it has not seen 
data, Entries !> 0.             |
+
+
+## Combinations of States of Two Sketches
+Each sketch can have 5 valid states, which means we can have 25 combinations 
of states of two sketches as expanded in the following table.
+ 
+| ID | Sketch A               | Sketch B               | Intersection Result  
| AnotB Result        | Result Actions |
+|:--:|:----------------------:|:----------------------:|:--------------------:|:-------------------:|:--------------:|
+| 00 | ResultDegen {<1.0,0,F} | ResultDegen {<1.0,0,F} | New {minTheta,0,F}   
| New {minTheta,0,F}  | 2,2            |
+| 01 | ResultDegen {<1.0,0,F} | NewDegen {<1.0,0,T}    | New {1.0,0,T}        
| New {ThetaA,0,F}    | 1,3            |
+| 02 | ResultDegen {<1.0,0,F} | Estimation {<1.0,>0,F} | New {minTheta,0,F}   
| New {minTheta,0,F}  | 2,2            |
+| 05 | ResultDegen {<1.0,0,F} | New {1.0,0,T}          | New {1.0,0,T}        
| New {ThetaA,0,F}    | 1,3            |
+| 06 | ResultDegen {<1.0,0,F} | Exact {1.0,>0,F}       | New {minTheta,0,F}   
| New {ThetaA,0,F}    | 2,3            |
+| 10 | NewDegen {<1.0,0,T}    | ResultDegen {<1.0,0,F} | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 11 | NewDegen {<1.0,0,T}    | NewDegen {<1.0,0,T}    | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 12 | NewDegen {<1.0,0,T}    | Estimation {<1.0,>0,F} | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 15 | NewDegen {<1.0,0,T}    | New {1.0,0,T}          | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 16 | NewDegen {<1.0,0,T}    | Exact {1.0,>0,F}       | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 20 | Estimation {<1.0,>0,F} | ResultDegen {<1.0,0,F} | New {minTheta,0,F}   
| Trim A by minTheta  | 2,4            |
+| 21 | Estimation {<1.0,>0,F} | NewDegen {<1.0,0,T}    | New {1.0,0,T}        
| Sketch A            | 1,5            |
+| 22 | Estimation {<1.0,>0,F} | Estimation {<1.0,>0,F} | Full Intersect       
| Full AnotB          | 6,7            |
+| 25 | Estimation {<1.0,>0,F} | New {1.0,0,T}          | New {1.0,0,T}        
| Sketch A            | 1,5            |
+| 26 | Estimation {<1.0,>0,F} | Exact {1.0,>0,F}       | Full Intersect       
| Full AnotB          | 6,7            |
+| 50 | New {1.0,0,T}          | ResultDegen {<1.0,0,F} | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 51 | New {1.0,0,T}          | NewDegen {<1.0,0,T}    | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 52 | New {1.0,0,T}          | Estimation {<1.0,>0,F} | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 55 | New {1.0,0,T}          | New {1.0,0,T}          | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 56 | New {1.0,0,T}          | Exact {1.0,>0,F}       | New {1.0,0,T}        
| New {1.0,0,T}       | 1,1            |
+| 60 | Exact {1.0,>0,F}       | ResultDegen {<1.0,0,F} | New {minTheta,0,F}   
| Trim A by minTheta  | 2,4            |
+| 61 | Exact {1.0,>0,F}       | NewDegen {<1.0,0,T}    | New {1.0,0,T}        
| Sketch A            | 1,5            |
+| 62 | Exact {1.0,>0,F}       | Estimation {<1.0,>0,F} | Full Intersect       
| Full AnotB          | 6,7            |
+| 65 | Exact {1.0,>0,F}       | New {1.0,0,T}          | New {1.0,0,T}        
| Sketch A            | 1,5            |
+| 66 | Exact {1.0,>0,F}       | Exact {1.0,>0,F}       | Full Intersect       
| Full AnotB          | 6,7            |
+
+The description of each column:
+
+* ID: two octal digits, the first digit represents the state of Sketch A, the 
second digit represents the state of Sketch B.
+* Sketch A State
+* Sketch B State
+* Intersection Result
+* AnotB Result
+* The octal representation of the Intersection Result followed by the octal 
representation of the AnotB result. The result codes are given by the following 
table:
+
+| Result Action | Result Code | Description            | 
+|:-------------:|:-----------:|:-----------------------|         
+| New{1.0,0,T}  |     1       | New empty sketch       |
+| New{min,0,F}  |     2       | Min=min(thetaA,thetaB) |
+| New{thA,0,F}  |     3       | thA=theta of A         |
+| SkA Min       |     4       | Trim A by minTheta     |
+| Sketch A      |     5       | Sketch A exactly       |
+| Full Inter    |     6       | Full intersect         |
+| Full AnotB    |     7       | Full AnotB             |
+
+
+Note that the results of a *Full Intersect* or a *Full AnotB* will require 
further interpretation of the resulting state.
+For example, if the resulting sketch is *{1.0,0,?}*, then a *New{1.0,0,T}* is 
returned. 
+If the resulting sketch is *{<1.0,0,?}* then a *ResultDegen{<1.0,0,F}* is 
returned.  
+Otherwise, the sketch returned will be an estimating or exact *{theta, >0, F}*.
+
+## Testing
+The above information is encoded as a model into the special class 
*org.apache.datasketches.SetOperationCornerCases.java*. This class is made up 
of enums and static methods to quickly determine for a sketch what actions to 
take based on the state of the input arguments.  This model is independent of 
the implementation of the Theta Sketch, whether the set operation is performed 
as a Theta Sketch, or a Tuple Sketch and when translated can be used in other 
languages as well.  
+
+Before this model was put to use an extensive set of tests was designed to 
test any potential implementation against this model.  These tests are slightly 
different for the Tuple Sketch than the Theta Sketch because the Tuple Sketch 
has more combinations to test, but the model is the same.
+
+* The tests for the Theta Sketch can be found in the class 
*org.apache.datasketches.theta.CornerCaseThetaSetOperationsTest.java*
+* The tests for the Tuple Sketch can be found in the class 
*org.apache.datasketches.tuple.aninteger.CornerCaseTupleSetOperationsTest.java*
+
+The details of how this mode is used in run-time code can be found in the 
class *org.apache.datasketches.tuple.AnotB.java*.
+
+

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to