(datasketches-website) branch asf-site updated: Automatic Site Publish by Buildbot

git-site-role Mon, 12 Jan 2026 22:37:05 -0800

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 7044f841 Automatic Site Publish by Buildbot
7044f841 is described below

commit 7044f841035a04ad1d08c96dcc79315a6763218d
Author: buildbot <[email protected]>
AuthorDate: Tue Jan 13 06:36:34 2026 +0000

    Automatic Site Publish by Buildbot
---
 .../docs/Frequency/FrequentDistinctTuplesSketch.html  |  14 ++++++++------
 output/docs/pdf/KevinsLastSketch_FDT_2019.pdf         | Bin 0 -> 2591137 bytes
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/output/docs/Frequency/FrequentDistinctTuplesSketch.html 
b/output/docs/Frequency/FrequentDistinctTuplesSketch.html
index 4e866fc8..c42a9372 100644
--- a/output/docs/Frequency/FrequentDistinctTuplesSketch.html
+++ b/output/docs/Frequency/FrequentDistinctTuplesSketch.html
@@ -335,6 +335,8 @@
 -->
 <h2 id="frequent-distinct-tuples-sketch">Frequent Distinct Tuples Sketch</h2>
 
+<p>See also: <a 
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/KevinsLastSketch_FDT_2019.pdf";>FDT:
 Kevin’s Last Sketch 2019</a></p>
+
 <h3 id="the-task">The Task</h3>
 <p>Suppose our data is a stream of pairs {IP address, User ID} and we want to 
identify the IP addresses that
 have the most distinct User IDs.  Or conversely, we would like to identify the 
User IDs that have the 
@@ -354,7 +356,7 @@ of the <i>N - M</i> non-primary dimensions.</p>
 </ul>
 
 <p>Suppose we have a stream of 160 items where the stream consists of four 
item types: A, B, C, and D.
-If the distribution of occurances was shared equally across the four items 
each would
+If the distribution of occurrences was shared equally across the four items 
each would
 occur exactly 40 times or 25% of the total distribution of 160 items. Thus the 
equally distributed
 (or fair share) <i>threshold</i> would be 25% or as a fraction 0.25.</p>
 
@@ -363,7 +365,7 @@ occur exactly 40 times or 25% of the total distribution of 
160 items. Thus the e
 </ul>
 
 <p>We define <i>Most Frequent</i> items as those that consume more than the 
fair share threshold of the
-total occurances (also called the <i>weight</i>) of the entire stream.</p>
+total occurrences (also called the <i>weight</i>) of the entire stream.</p>
 
 <p>Suppose we have a stream of 160 items where the stream consists of four 
item types: A, B, C, and D,
 which have the following frequency distribution:</p>
@@ -379,7 +381,7 @@ which have the following frequency distribution:</p>
 declare C and D in a list of most frequent items since their respective 
frequencies are below 
 the threshold of 40 or 25%.</p>
 
-<p>If all items occured with a frequency of 40, we could not declare 
+<p>If all items occurred with a frequency of 40, we could not declare 
 any item as most frequent. Requesting a list of the “Top 4” items could be a 
list of the 4 items in any random
 order, or a list of zero items, depending on policy.</p>
 
@@ -428,7 +430,7 @@ most frequent primary key combination, which means the 
possible existance of fal
 
 <h3 id="using-the-fdtsketch">Using the FdtSketch</h3>
 
-<p>Let’s leverate the challenge at the beginning to crete a concrete example. 
+<p>Let’s leverage the challenge at the beginning to create a concrete example. 
 Let’s assume <i>N = 2</i> and let <i>d1 := IP address</i>, and <i>d2 := User 
ID</i>.</p>
 
 <p>If we choose <i>{d1}</i> as the Primary Keys, then the sketch will allow us 
to identify the
@@ -459,7 +461,7 @@ while (inputStream.hasRemainingItems()) {
 <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>int[] priKeyIndices = new int[] {0}; //identifies the 
IP address as the primary key
 int numStdDev = 2; //for 95% confidence intervals
 int limit = 20; //list only the top 20 groups
-char sep = '|'; //the separator charactor for the group dimensions as strings
+char sep = '|'; //the separator character for the group dimensions as strings
 List&lt;Group&gt; list = sketch.getResult(priKeyIndices, limit, numStdDev, 
sep);
 System.out.println(Group.getHeader())
 Iterator&lt;Group&gt; itr = list.iterator()
@@ -521,7 +523,7 @@ input stream of over 100K groups this graph is a view of 
the top 500, which is m
 
 <p>The blue dots represent the error of a single group from the top 500 
groups. Not all of the top 500 groups are shown on the graph as number of them 
had true cardinalities of less than 256. Also many of the dots represent 
multiple groups since groups with the same Count and the same true cardinality 
will result in the same exact computed error, thus plotted at the same exact 
point.</p>
 
-<p>The red line is the contour of the quantile(0.84) points of the error 
distribution at each point along the X-axis. This quantile contour would be 
equivalent to the +1 standard deviation from the mean of a Gaussian 
distribution. But since these are quantile measurements of the actual error 
distribution there is no assuption whatsoever that the error distribution is 
Gaussian.  It is just a convenient reference contour. Similarly the black line 
is the contour of the quantile(0.159), whic [...]
+<p>The red line is the contour of the quantile(0.84) points of the error 
distribution at each point along the X-axis. This quantile contour would be 
equivalent to the +1 standard deviation from the mean of a Gaussian 
distribution. But since these are quantile measurements of the actual error 
distribution there is no assumption whatsoever that the error distribution is 
Gaussian.  It is just a convenient reference contour. Similarly the black line 
is the contour of the quantile(0.159), whi [...]
 
 <p>The following table is the list of the top 10 results from just one of the 
trials. The Group class was extended to include more columns at the end which 
were useful for this study. (This was easy to do and does not require any 
special access.)</p>
 
diff --git a/output/docs/pdf/KevinsLastSketch_FDT_2019.pdf 
b/output/docs/pdf/KevinsLastSketch_FDT_2019.pdf
new file mode 100644
index 00000000..86a83888
Binary files /dev/null and b/output/docs/pdf/KevinsLastSketch_FDT_2019.pdf 
differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datasketches-website) branch asf-site updated: Automatic Site Publish by Buildbot

Reply via email to