This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7044f841 Automatic Site Publish by Buildbot
7044f841 is described below
commit 7044f841035a04ad1d08c96dcc79315a6763218d
Author: buildbot <[email protected]>
AuthorDate: Tue Jan 13 06:36:34 2026 +0000
Automatic Site Publish by Buildbot
---
.../docs/Frequency/FrequentDistinctTuplesSketch.html | 14 ++++++++------
output/docs/pdf/KevinsLastSketch_FDT_2019.pdf | Bin 0 -> 2591137 bytes
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/output/docs/Frequency/FrequentDistinctTuplesSketch.html
b/output/docs/Frequency/FrequentDistinctTuplesSketch.html
index 4e866fc8..c42a9372 100644
--- a/output/docs/Frequency/FrequentDistinctTuplesSketch.html
+++ b/output/docs/Frequency/FrequentDistinctTuplesSketch.html
@@ -335,6 +335,8 @@
-->
<h2 id="frequent-distinct-tuples-sketch">Frequent Distinct Tuples Sketch</h2>
+<p>See also: <a
href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/KevinsLastSketch_FDT_2019.pdf">FDT:
Kevin’s Last Sketch 2019</a></p>
+
<h3 id="the-task">The Task</h3>
<p>Suppose our data is a stream of pairs {IP address, User ID} and we want to
identify the IP addresses that
have the most distinct User IDs. Or conversely, we would like to identify the
User IDs that have the
@@ -354,7 +356,7 @@ of the <i>N - M</i> non-primary dimensions.</p>
</ul>
<p>Suppose we have a stream of 160 items where the stream consists of four
item types: A, B, C, and D.
-If the distribution of occurances was shared equally across the four items
each would
+If the distribution of occurrences was shared equally across the four items
each would
occur exactly 40 times or 25% of the total distribution of 160 items. Thus the
equally distributed
(or fair share) <i>threshold</i> would be 25% or as a fraction 0.25.</p>
@@ -363,7 +365,7 @@ occur exactly 40 times or 25% of the total distribution of
160 items. Thus the e
</ul>
<p>We define <i>Most Frequent</i> items as those that consume more than the
fair share threshold of the
-total occurances (also called the <i>weight</i>) of the entire stream.</p>
+total occurrences (also called the <i>weight</i>) of the entire stream.</p>
<p>Suppose we have a stream of 160 items where the stream consists of four
item types: A, B, C, and D,
which have the following frequency distribution:</p>
@@ -379,7 +381,7 @@ which have the following frequency distribution:</p>
declare C and D in a list of most frequent items since their respective
frequencies are below
the threshold of 40 or 25%.</p>
-<p>If all items occured with a frequency of 40, we could not declare
+<p>If all items occurred with a frequency of 40, we could not declare
any item as most frequent. Requesting a list of the “Top 4” items could be a
list of the 4 items in any random
order, or a list of zero items, depending on policy.</p>
@@ -428,7 +430,7 @@ most frequent primary key combination, which means the
possible existance of fal
<h3 id="using-the-fdtsketch">Using the FdtSketch</h3>
-<p>Let’s leverate the challenge at the beginning to crete a concrete example.
+<p>Let’s leverage the challenge at the beginning to create a concrete example.
Let’s assume <i>N = 2</i> and let <i>d1 := IP address</i>, and <i>d2 := User
ID</i>.</p>
<p>If we choose <i>{d1}</i> as the Primary Keys, then the sketch will allow us
to identify the
@@ -459,7 +461,7 @@ while (inputStream.hasRemainingItems()) {
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre
class="highlight"><code>int[] priKeyIndices = new int[] {0}; //identifies the
IP address as the primary key
int numStdDev = 2; //for 95% confidence intervals
int limit = 20; //list only the top 20 groups
-char sep = '|'; //the separator charactor for the group dimensions as strings
+char sep = '|'; //the separator character for the group dimensions as strings
List<Group> list = sketch.getResult(priKeyIndices, limit, numStdDev,
sep);
System.out.println(Group.getHeader())
Iterator<Group> itr = list.iterator()
@@ -521,7 +523,7 @@ input stream of over 100K groups this graph is a view of
the top 500, which is m
<p>The blue dots represent the error of a single group from the top 500
groups. Not all of the top 500 groups are shown on the graph as number of them
had true cardinalities of less than 256. Also many of the dots represent
multiple groups since groups with the same Count and the same true cardinality
will result in the same exact computed error, thus plotted at the same exact
point.</p>
-<p>The red line is the contour of the quantile(0.84) points of the error
distribution at each point along the X-axis. This quantile contour would be
equivalent to the +1 standard deviation from the mean of a Gaussian
distribution. But since these are quantile measurements of the actual error
distribution there is no assuption whatsoever that the error distribution is
Gaussian. It is just a convenient reference contour. Similarly the black line
is the contour of the quantile(0.159), whic [...]
+<p>The red line is the contour of the quantile(0.84) points of the error
distribution at each point along the X-axis. This quantile contour would be
equivalent to the +1 standard deviation from the mean of a Gaussian
distribution. But since these are quantile measurements of the actual error
distribution there is no assumption whatsoever that the error distribution is
Gaussian. It is just a convenient reference contour. Similarly the black line
is the contour of the quantile(0.159), whi [...]
<p>The following table is the list of the top 10 results from just one of the
trials. The Group class was extended to include more columns at the end which
were useful for this study. (This was easy to do and does not require any
special access.)</p>
diff --git a/output/docs/pdf/KevinsLastSketch_FDT_2019.pdf
b/output/docs/pdf/KevinsLastSketch_FDT_2019.pdf
new file mode 100644
index 00000000..86a83888
Binary files /dev/null and b/output/docs/pdf/KevinsLastSketch_FDT_2019.pdf
differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]