This is an automated email from the ASF dual-hosted git repository.
leerho pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/datasketches-website.git
The following commit(s) were added to refs/heads/master by this push:
new 8057c258 Update Java version to 6.0.0
8057c258 is described below
commit 8057c258efbf9a4270633c563153808ab9b26e36
Author: Lee Rhodes <[email protected]>
AuthorDate: Sun Apr 28 18:48:19 2024 -0700
Update Java version to 6.0.0
Update website docs wrt quantile sketches.
---
docs/Community/Downloads.md | 2 +-
...uantilesSketch.md => ClassicQuantilesSketch.md} | 2 +-
docs/Quantiles/QuantilesOverview.md | 74 +++---
.../SketchingQuantilesAndRanksTutorial.md | 251 +++++++++++----------
src/main/java/org/apache/datasketches/Files.java | 1 -
src/main/resources/docgen/toc.json | 4 +-
6 files changed, 181 insertions(+), 153 deletions(-)
diff --git a/docs/Community/Downloads.md b/docs/Community/Downloads.md
index 6909c6de..235038de 100644
--- a/docs/Community/Downloads.md
+++ b/docs/Community/Downloads.md
@@ -41,7 +41,7 @@ If you are developing using Maven and want to use, for
example, datasketches-jav
<dependency>
<groupId>org.apache.datasketches</groupId>
<artifactId>datasketches-java</artifactId>
- <version>3.1.0</version>
+ <version>6.0.0</version>
</dependency>
```
diff --git a/docs/Quantiles/OrigQuantilesSketch.md
b/docs/Quantiles/ClassicQuantilesSketch.md
similarity index 99%
rename from docs/Quantiles/OrigQuantilesSketch.md
rename to docs/Quantiles/ClassicQuantilesSketch.md
index b0caf42c..26d969fd 100644
--- a/docs/Quantiles/OrigQuantilesSketch.md
+++ b/docs/Quantiles/ClassicQuantilesSketch.md
@@ -19,7 +19,7 @@ layout: doc_page
specific language governing permissions and limitations
under the License.
-->
-# Original Quantiles Sketch
+# Classic Quantiles Sketch
## Quantiles Sketches Accuracy and Size
Please review the Quantiles
[Definitions]({{site.docs_dir}}/Quantiles/Definitions.html).
diff --git a/docs/Quantiles/QuantilesOverview.md
b/docs/Quantiles/QuantilesOverview.md
index 9cd46dd5..6fa83fb5 100644
--- a/docs/Quantiles/QuantilesOverview.md
+++ b/docs/Quantiles/QuantilesOverview.md
@@ -19,37 +19,49 @@ layout: doc_page
specific language governing permissions and limitations
under the License.
-->
-# Introduction to the 3 Quantiles Sketches
+# Introduction to the Quantile Sketches
-## Quantile-type sketches in the library
+This is a quick overview of the quantiles sketches in the library. Each of
these quantile types may have one or more specific implementaions and some
variation in API depending on the language. Three of the quantile sketches have
mathematically provable error bounds while the fourth is an empirical algorithm.
+The three sketches with mathematically provable error bounds are:
-This is an overview of the three types of quantiles sketches in the library.
Each of these quantile types may have one or more specific implementtaions.
+* The Classic quantile sketch family
+* The KLL quantile sketch family
+* The REQ quantile sketch
-The mathematical error bounds of all the quantile sketches is specified with
respect to rank and not with respect to quantiles. In other words, the
difference between the rank upper bound and the rank lower bound is the
confidence interval and can be expressed as a percent of the overall rank
distribution (which is 1.0) and is the mathematically derived error for a
specific configuration of the sketch.
+The one empirical quantile sketch is the T-Digest sketch.
-Although the quantile upper bound and quantile lower bounds can be
approximately computed from the rank upper bound and rank lower bound, and the
difference between the quantile bounds is also an approximate confidence
interval, the size of the quantile confidence interval may not be meaningful
and is not constrained by the defined error of the sketch.
+The mathematical error bounds of the Classic, KLL and REQ sketches are
specified with respect to rank and not with respect to quantiles.
-These sketches have many parallel methods. Please refer to the individual
Javadocs for more information.
+The T-Digest is empirical and has no mathematical basis for estimating its
error and its results are dependent on the input data. However, for many common
data distributions, it can produce excellent results. Please refer to the
spcific documentation about the T-Digest sketch.
-### The REQ Sketch
+For the Classic and KLL sketches, the difference between the rank upper bound
and the rank lower bound is a 99% confidence interval and is an additive
constant for all normalized ranks between 0.0 and 1.0. The specific error is a
function of the parameter <i>K</i> of the sketch and can be derived from the
sketch. For example, if the rank error for a given K is 1%, then the error of
a result rank of .01 is +/- .01 with a 99% confidence; the error of a result
rank of .99 is +/- .01 with a [...]
-* Java
- * Release 2.0.0, 12 Feb 2021
+The REQ sketch is special in that it's error is also relative to the actual
result rank (thus its name: Relative Error Quantiles). It was designed to
proved very high rank accuacy for either the high end of the range of ranks
(close to 1.0) or, based on the user's choice, the low end of ranks (close to
0.0). Please refer to the spcific documentation about the REQ sketch.
+
+Although upper and lower quantile bounds can be approximately computed from
the upper and lower rank bounds, and the difference between the quantile bounds
is also an approximate confidence interval, the size of the quantile confidence
interval may not be meaningful and is not constrained by the defined rank error
of the sketch.
+
+For example, a 1% rank error at a result rank of .50 corresponds to a rank
range of .49 to .51. However, in the quantile domain, there could be step
function in the values such that the difference in the quantiles corresponding
to ranks .49 and .51 are as large as the the entire dynamic range of all
quantiles. It is possible to retrieve the upper and lower quantile bounds at a
specific rank, but the difference between these quantile bounds may not be
meaningful.
+
+These sketches have many parallel methods. Please refer to the sketch API
documentation by language for more information.
+
+### The Classic Quantiles Sketch
+
+* Java
+ * Release 0.3.0, 25 Jan 2016
* Repo: <https://github.com/apache/datasketches-java>
- * Package: org.apache.datasketches.req
+ * Package: org.apache.datasketches.quantiles
* C++, Python
- * Release 2.2.0, Soon
+ * Release 1.0.0, 17 Sep 2019
* Repo: <https://github.com/apache/datasketches-cpp>
- * Directory: req
* Key Features (both Java & C++)
- * Accuracy %: a function of *K* and **relative** with respect to rank.
The user can select either High Rank Accuracy (HRA) or Low Rank Accuracy (LRA).
This enables extremely high accuracy for the ends of the rank domain. E.g.,
99.999%ile quantiles.
- * User selectable comparison QuantileSearchCriteria:
+ * User selectable search criteria QuantileSearchCriteria:
* Exclusive, which is compatible with the KLL and older
Quantiles Sketch
- * Inclusive, a common definition in some of the theoretical
literature.
- * Java: Dedicated *float* implementation.
- * C++: Template implementation for arbitrary comparible types.
- * Python: Dedicated *float* and *integer* implementations.
+ * Inclusive, a common definition in some of the theoretical
literature.
+ * Accuracy %: a function of *K* and independent of rank.
+ * Dedicated *double* implentation, which can be configured for off-heap
operation.
+ * Generic implementation for arbitrary comparible types.
+
### The KLL Sketch
@@ -71,25 +83,25 @@ These sketches have many parallel methods. Please refer to
the individual Javado
* C++: Template implementation for arbitrary comparible types.
* Python: Dedicated *float* and *integer* implementations
+### The REQ Sketch
-### The Quantiles Sketch
-
-* Java
- * Release 0.3.0, 25 Jan 2016
+* Java
+ * Release 2.0.0, 12 Feb 2021
* Repo: <https://github.com/apache/datasketches-java>
- * Package: org.apache.datasketches.quantiles
+ * Package: org.apache.datasketches.req
* C++, Python
- * Release 1.0.0, 17 Sep 2019
+ * Release 2.2.0, Soon
* Repo: <https://github.com/apache/datasketches-cpp>
- * Directory:
+ * Directory: req
* Key Features (both Java & C++)
- * User selectable comparison QuantileSearchCriteria:
+ * Accuracy %: a function of *K* and **relative** with respect to rank.
The user can select either High Rank Accuracy (HRA) or Low Rank Accuracy (LRA).
This enables extremely high accuracy for the ends of the rank domain. E.g.,
99.999%ile quantiles.
+ * User selectable comparison QuantileSearchCriteria:
* Exclusive, which is compatible with the KLL and older
Quantiles Sketch
- * Inclusive, a common definition in some of the theoretical
literature.
- * Accuracy %: a function of *K* and independent of rank.
- * Dedicated *double* implentation.
- * java generic implementation for arbitrary comparible types.
- * The dedicated *double* implementation can be configured for off-heap
operation.
+ * Inclusive, a common definition in some of the theoretical
literature.
+ * Java: Dedicated *float* implementation.
+ * C++: Template implementation for arbitrary comparible types.
+ * Python: Dedicated *float* and *integer* implementations.
+
diff --git a/docs/Quantiles/SketchingQuantilesAndRanksTutorial.md
b/docs/Quantiles/SketchingQuantilesAndRanksTutorial.md
index 3627d8c9..ccad2826 100644
--- a/docs/Quantiles/SketchingQuantilesAndRanksTutorial.md
+++ b/docs/Quantiles/SketchingQuantilesAndRanksTutorial.md
@@ -36,19 +36,22 @@ The actual enumeration can be done in several ways, but for
our use here we will
* The **natural rank** is a **natural number** from the set of one-based,
natural numbers, ℕ<sub>1</sub>, and is derived by enumerating an ordered
set of values, starting with the value 1, up to *n*, the number of values in
the original set.
-* The ***normalized rank*** is a number between 0.0 and 1.0 computed by
dividing the *natural rank* by the total number of values in the set, *n*.
Thus, for finite sets, any *normalized rank* is in the range (0, 1]. Normalized
ranks are often written as a percent. But don't confuse percent with
percentile! This will be explained below.
+* The ***normalized rank*** is a number between 0.0 and 1.0 computed by
dividing the *natural rank* by the total number of values in the set, *n*.
Thus, for finite sets, any *normalized rank* is in the range (0, 1]. Normalized
ranks are often written as a percent.
+For example, the value associated with the normalized rank of 0.5 is the
median of the distribution and is also the value at the 50th percentile of the
distribution.
* A rank of 0, whether natural or normalized, represents the empty set.
-In our sketch library and documentation, when we refer to *rank*, we imply
*normalized rank*. However, in this tutorial, we will sometimes use *natural
ranks* to simplify the examples.
+In our sketch library APIs and documentation, when we refer to *rank*, we
imply *normalized rank*, because *normalized ranks* can be specified
independent of *n*. However, in this tutorial, we will sometimes use *natural
ranks* to simplify the examples.
### Rank and Mass
-*Normalized rank* is closely associated with the concept of *mass*. The value
associated with the rank 0.5 represents the median value, or the center of
*mass* of the entire set, where half of the values are below the median and
half are above. The concept of mass is important to understanding the
Probability Mass Function (PMF) offered by all the quantile sketches in the
library.
+*Normalized rank* is closely associated with the concept of *mass*. The value
associated with the *normalized rank* 0.5 represents the median value, or the
center of *mass* of the entire set, where half of the values are below the
median and half are above. The concept of mass is important to understanding
the Probability Mass Function (PMF) offered by most of the quantile sketches in
the library.
## What is a quantile?
### A ***quantile*** is a ***value*** that is associated with a particular
***rank***.
+When an ordered set of values is associated with their ranks it becomes
enumerated and we refer to the set of values as *quantiles*. In this context
we can think of a quantile and its rank as a pair *{quantile, rank}* and an
ordered set of these pairs as a *cumulative distribution*, or CD. When we add
the capability to search to this CD, it becomes a *cumulative distribution
function*, or CDF. We can query this CDF by rank to find its associated
quantile, *q = Q(r)*, or we can query th [...]
+
*Quantile* is the general term that includes other terms that are also
quantiles.
To wit:
@@ -70,28 +73,31 @@ The term "value" can be ambiguous because items that we
stream into a sketch are
Let's define the simple functions
-### ***quantile(rank)*** or ***q(r)*** := return the quantile ***q***
associated with a given ***rank, r***.
+### ***quantile(rank)*** or ***Q(r)*** := return the quantile ***q***
associated with a given ***rank, r***.
-### ***rank(quantile)*** or ***r(q)*** := return the rank ***r*** associated
with a given ***quantile, q***.
+### ***rank(quantile)*** or ***R(q)*** := return the rank ***r*** associated
with a given ***quantile, q***.
Using an example from the table:
* Using natural ranks:
- * *q(3) = 30*
- * *r(30) = 3*
+ * *Q(3) = 30*
+ * *R(30) = 3*
* Using normalized ranks:
- * *q(.6) = 30*
- * *r(30) = .6*
+ * *Q(.6) = 30*
+ * *R(30) = .6*
Because of the close, two-way relationship of quantiles and ranks,
-*r(q)* and *q(r)* form a *1:1 functional pair* if, and only if
+*R(q)* and *Q(r)* form a *1:1 functional pair* if, and only if
-* *q = q(r(q))*
-* *r = r(q(r))*
+* *q = Q(R(q))*
+* *r = R(Q(r))*
And this is certainly true of the table above.
+
+## NOTE The following part of this tutorial only applies to the Classic, KLL,
and REQ quantile sketches. The T-Digest is an entirely different type of sketch.
+
## The challenge of duplicates
With real data we often encounter duplicate quantiles in the stream. Let's
examine this next table.
@@ -99,7 +105,7 @@ With real data we often encounter duplicate quantiles in the
stream. Let's exami
| ------------ | --- | --- | --- | --- | --- |
| Natural Rank | 1 | 2 | 3 | 4 | 5 |
-As you can see *q(r)* is straightforward. But how about *r(q)*? Which of the
ranks 2, 3, or 4 should the function return, given the quantile 20? Given this
data, and our definitions so far, the function *r(q)* is ambiguous. We will see
how to resolve this shortly.
+As you can see *Q(r)* is straightforward. But how about *R(q)*? Which of the
ranks 2, 3, or 4 should the function return, given the quantile 20? Given this
data, and our definitions so far, the function *R(q)* is ambiguous. We will see
how to resolve this shortly.
## The challenge of approximation
@@ -117,128 +123,171 @@ The sketch might discard the even numbered quantiles
producing something like th
| ------------ | --- | --- | --- | --- | --- |
| Natural Rank | 2 | 4 | 6 | 8 | 10 |
-When the sketch deletes quantiles it adjusts the associated ranks by
effectively increasing the "weight" of adjacent quantiles so that they are
approximately positionally correct and the top natural rank corresponds to *n*.
+When the sketch deletes quantiles it adjusts the associated ranks by
effectively increasing the "weight" of retained quantiles so that they are
approximately positionally correct and the top natural rank corresponds to *n*.
+
+### NOTE: With the DataSketches Java library 6.0.0 and after, the min and max
quantiles of the stream are reinserted in to the Sorted View representation of
the stream if the sketch has deleted them. Therefore, the second table above
would look something like this:
-How do we resolve *q(3)* or *r(20)*?
+| Quantile: | 10 | 30 | 50 | 70 | 90 | 100 |
+| ------------ | --- | --- | --- | --- | --- | --- |
+| Natural Rank | 2 | 4 | 6 | 8 | 9 | 10 |
+
+How do we resolve *Q(3)* or *R(20)*?
## The need for inequality search
-The quantile sketch algorithms discussed in the literature primarily differ by
how they choose which quantiles in the stream should be discarded. After the
elimination process, all of the quantiles sketch implementations are left with
the challenge of how to reconstruct the actual distribution, approximately and
with good accuracy.
+The quantile sketch algorithms discussed in the literature primarily differ by
how they choose which quantiles in the stream should be discarded. After the
elimination process, all of the quantiles sketch implementations are left with
the challenge of how to reconstruct the actual distribution of the stream,
approximately and with good accuracy.
-Given the presence of duplicates and absence of values from the stream we must
redefine the above quantile and rank functions as inequalities **while
retaining the properties of 1:1 functions.**
+Given the presence of duplicates and absence of values from the stream we must
redefine the above quantile and rank functions as inequalities while retaining
the properties of 1:1 functions as closely as possible.**
One can find examples of the following definitions in the research literature.
All of our library quantile sketches allow the user to choose the searching
criteria.
-These next examples use a small data set that mimics what could be the result
of both duplication and sketch data deletion.
-
-## The rules for returned quantiles or ranks
+These next examples use a small data set that mimics what could be the result
of both input duplication and sketch data deletion.
-* **Rule 1:** Every quantile that exists in the input stream or retained by
the sketch has an associated rank.
+## The rules for returned quantiles or ranks for Classic, KLL, and REQ sketches
-* **Rule 2:** All of our quantile sketches only retain quantiles that exist in
the actual input stream of quantiles.
+* **Rule 1:** Every quantile that exists in the input stream or retained by
the sketch has an associated rank when sorted.
-* **Rule 3:** For the *getQuantile(rank)* queries, all of our quantile
sketches only return quantiles that were retained by the sketch. (i.e, we do
not interpolate between quantiles.)
+* **Rule 2:** These quantile sketches retain and return only quantiles that
exist in the actual input stream of quantiles. In other words, these quantile
sketches do not retain or return interpolations beteen quantiles from the input
stream. (The T-Digest sketch does do interpolations.)
-* **Rule 4:** For the *getRank(quantile)* queries, all of our quantile
sketches only return ranks that are associated with quantiles retained by the
sketch. (i.e, we do not interpolate between ranks.)
+* **Rule 3:** For the *getRank(quantile)* queries, all of our quantile
sketches only return ranks that are associated with quantiles retained by the
sketch. In other words, these sketches do not interpolate between ranks.
-* **Rule 5:** All of our quantile algorithms compensate for quantiles removed
during the sketch quantile selection and compression process by increasing the
weights of some of the quantiles not selected for removal, such that:
+* **Rule 4:** The quantile algorithms of these sketches compensate for
quantiles removed during the sketch compacton process by increasing the weights
of the quantiles not selected for removal, such that:
- * The sum of the natural weights of all quantiles retained by the sketch
equals **n**, the total count of all quantiles given to the sketch.
+ * The sum of the individual item weights of all quantiles retained by the
sketch equals **n**, the total count of all quantiles given to the sketch.
* And by corollary, the largest quantile, when sorted by cumulative rank,
has a cumulative natural rank of **n**, or equivalently, a cumulative
normalized rank of **1.0**.
## The rank functions with inequalities
-### ***rank(quantile, INCLUSIVE)*** or ***r(q, LE)*** :=<br>Given *q*, return
the rank, *r*, of the largest quantile that is less than or equal to *q*.
+### ***rank(quantile, INCLUSIVE)*** or ***R(q, LE)*** :=<br>Given *q*, return
the rank, *r*, of the largest quantile that is Less Than or Equal To *q*.
-<b>Implementation:</b>
+*Implementation: Normal Rule*
* Given *q*, search the quantile array until we find the adjacent pair *{q1,
q2}* where *q1 <= q < q2*.
* Return the rank, *r*, associated with *q1*, the first of the pair.
-<b>Boundary Exceptions:</b>
+*Boundary Exceptions:*
-* **Boundary Rule 1:** If the given *q* is *>=* the quantile associated with
the largest cumulative rank retained by the sketch, the function will return
the largest cumulative rank, *1.0*.
-* **Boundary Rule 2:** If the given *q* is *<* the quantile associated with
the smallest cumulative rank retained by the sketch, the function will return a
rank of *0.0*.
+* **Boundary Rule RI-1:** If the given *q* is *>=* the quantile associated
with the largest cumulative rank retained by the sketch, the function will
return the largest cumulative rank, *1.0*.
+* **Boundary Rule RI-2:** If the given *q* is *<* the quantile associated with
the smallest cumulative rank retained by the sketch, the function will return a
rank of *0.0*. There is no quantile that is <= *q*. Returning *0.0* is
essentially the same as *Empty* or *null*.
-<b>Examples using normalized ranks:</b>
+*Examples using normalized ranks. The given value is underlined.*
-* *r(30) = .786* Normal rule applies: *30 <= 30 < 40*, return *r(q1) = .786*.
+* *r(_10_) = .071* Normal rule applies: *10 <= _10_ < 20*, return *r(q1) =
.071*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
-| Quantile input | | | | | | 30 | | |
+| Quantile input | _10_ | | | | | | | |
+| Qualifying pair | q1 | q2 | | | | | | |
+| Rank result | | | | | | | | |
+
+* *r(_30_) = .786* Normal rule applies: *30 <= _30_ < 40*, return *r(q1) =
.786*.
+
+| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
+| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
+| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
+| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
+| Quantile input | | | | | | _30_ | | |
| Qualifying pair | | | | | | q1 | q2 | |
| Rank result | | | | | | .786 | | |
+* *r(_50_) = 1.0* Use Boundary Rule RI-1: *50 <= _50_*, return *1.0*.
+
+| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50
| ? |
+| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
+| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14
| |
+| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0
| |
+| Quantile input | | | | | | | | _50_
| |
+| Qualifying pair | | | | | | | | q1
| (q2) |
+| Rank result | | | | | | | | 1.0
| |
-* *r(55) = 1.0* Use Boundary Rule 1: *50 <= 55*, return *1.0*.
+* *r(_55_) = 1.0* Use Boundary Rule RI-1: *50 <= _55_*, return *1.0*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50
| ? |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14
| |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0
| |
-| Quantile input | | | | | | | | 55
| |
+| Quantile input | | | | | | | | _55_
| |
| Qualifying pair | | | | | | | | q1
| (q2) |
| Rank result | | | | | | | | 1.0
| |
-* *r(5) = 0.0* Use Boundary Rule 2: *5 < 10*, return *0.0*.
+* *r(_5_) = 0.0* Use Boundary Rule RI-2: *_5_ < 10*, return *0.0*.
| Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40
| 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | | 1 | 3 | 5 | 7 | 9 | 11 | 13
| 14 |
| Normalized Rank[]: | | .071 | .214 | .357 | .500 | .643 | .786 | .929
| 1.0 |
-| Quantile input | 5 | | | | | | |
| |
+| Quantile input | _5_ | | | | | | |
| |
| Qualifying pair | (q1) | q2 | | | | | |
| |
| Rank result | 0 | | | | | | |
| |
--------
-### ***rank(quantile, EXCLUSIVE)*** or ***r(q, LT)*** :=<br>Given *q*, return
the rank, *r*, of the largest quantile that is strictly *Less Than* *q*.
+### ***rank(quantile, EXCLUSIVE)*** or ***R(q, LT)*** :=<br>Given *q*, return
the rank, *r*, of the largest quantile that is strictly Less Than *q*.
-<b>Implementation:</b>
+*Implementation: Normal Rule*
* Given *q*, search the quantile array until we find the adjacent pair *{q1,
q2}* where *q1 < q <= q2*.
* Return the rank, *r*, associated with *q1*, the first of the pair.
-<b>Boundary Exceptions:</b>
+*Boundary Exceptions:*
-* **Boundary Rule 1:** If the given *q* is *>* the quantile associated with
the largest cumulative rank retained by the sketch, the sketch will return the
the largest cumulative rank, *1.0*.
-* **Boundary Rule 2:** If the given *q* is *<=* the quantile associated with
the smallest cumulative rank retained by the sketch, the sketch will return a
rank of *0.0*.
+* **Boundary Rule RE-1:** If the given *q* is *>* the quantile associated with
the largest cumulative rank retained by the sketch, the sketch will return the
the largest cumulative rank, *1.0*.
+* **Boundary Rule RE-2:** If the given *q* is *<=* the quantile associated
with the smallest cumulative rank retained by the sketch, the sketch will
return a rank of *0.0*.
-<b>Examples using normalized ranks:</b>
+*Examples using normalized ranks. The given value is underlined.*
-* *r(30) = .357* Normal rule applies: *20 < 30 <= 30*, return *r(q1) = .357*.
+* *r(_30_) = .357* Normal rule applies: *20 < _30_ <= 30*, return *r(q1) =
.357*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
-| Quantile input | | | | 30 | | | | |
+| Quantile input | | | _30_ | | | | | |
| Qualifying pair | | | q1 | q2 | | | | |
| Rank result | | | .357 | | | | | |
-* *r(55) = 1.0* Use Boundary Rule 1: *50 < 55*, return *1.0*.
+* *r(_50_) = .929* Normal rule applies: *40 < _50_ <= 50*, return *r(q1) =
.929*.
+
+| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
+| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
+| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
+| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
+| Quantile input | | | | | | | _55_ | |
+| Qualifying pair | | | | | | | q1 | q2 |
+| Rank result | | | | | | | .929 | |
+
+* *r(_55_) = 1.0* Use Boundary Rule RE-1: *50 < _55_*, return *1.0*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50
| ? |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14
| |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0
| |
-| Quantile input | | | | | | | |
| 55 |
+| Quantile input | | | | | | | | _55_
| |
| Qualifying pair | | | | | | | | q1
| (q2) |
| Rank result | | | | | | | | 1.000
| |
-* *r(5) = 0.0* Use Boundary Rule 2: *5 <= 10*, return *0*.
+* *r(_10_) = 0.0* Use Boundary Rule RE-2: *_10_ <= 10*, return *0*.
+
+| Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40
| 50 |
+| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
+| Natural Rank[]: | | 1 | 3 | 5 | 7 | 9 | 11 | 13
| 14 |
+| Normalized Rank[]: | | .071 | .214 | .357 | .500 | .643 | .786 | .929
| 1.0 |
+| Quantile input | _0.0_| | | | | | |
| |
+| Qualifying pair | (q1) | q2 | | | | | |
| |
+| Rank result | 0 | | | | | | |
| |
+
+* *r(_5_) = 0.0* Use Boundary Rule RE-2: *_5_ <= 10*, return *0*.
| Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40
| 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | | 1 | 3 | 5 | 7 | 9 | 11 | 13
| 14 |
| Normalized Rank[]: | | .071 | .214 | .357 | .500 | .643 | .786 | .929
| 1.0 |
-| Quantile input | 5 | | | | | | |
| |
+| Quantile input | _5_ | | | | | | |
| |
| Qualifying pair | (q1) | q2 | | | | | |
| |
| Rank result | 0 | | | | | | |
| |
@@ -246,104 +295,104 @@ These next examples use a small data set that mimics
what could be the result of
## The quantile functions with inequalities
-### ***quantile(rank, INCLUSIVE)*** or ***q(r, GE)*** :=<br>Given *r*, return
the quantile, *q*, of the smallest rank that is strictly Greater than or Equal
to *r*.
+### ***quantile(rank, INCLUSIVE)*** or ***Q(r, GE)*** :=<br>Given *r*, return
the quantile, *q*, of the smallest rank that is strictly Greater Than or Equal
To *r*.
-<b>Implementation:</b>
+*Implementation: Normal Rule*
* Given *r*, search the rank array until we find the adjacent pair *{r1, r2}*
where *r1 < r <= r2*.
* Return the quantile, *q*, associated with *r2*, the second of the pair.
-<b>Boundary Exceptions:</b>
+*Boundary Exceptions:*
-* **Boundary Rule 2:** If the given normalized rank, *r*, is *<=* the smallest
rank, the function will return the **quantile** associated with the smallest
cumulative rank.
+* **Boundary Rule QI-1:** If the given normalized rank, *r*, is *<=* the
smallest rank, the function will return the **quantile** associated with the
smallest cumulative rank.
-<b>Examples using normalized ranks:</b>
+*Examples using normalized ranks. The given value is underlined.*
-* *q(.786) = 30* Normal rule applies: *.643 < .786 <= .786*, return *q(r2) =
30*.
+* *q(_.786_) = 30* Normal rule applies: *.643 < _.786_ <= .786*, return *q(r2)
= 30*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
-| Rank input | | | | | | .786 | | |
+| Rank input | | | | | |_.786_| | |
| Qualifying pair | | | | | r1 | r2 | | |
| Quantile result | | | | | | 30 | | |
-* *q(1.0) = 50* Normal rule applies: *.929 < 1.0 <= 1.0*, return *q(r2) = 50*.
+* *q(_1.0_) = 50* Normal rule applies: *.929 < _1.0_ <= 1.0*, return *q(r2) =
50*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
-| Rank input | | | | | | | | 1.0 |
+| Rank input | | | | | | | | _1.0_ |
| Qualifying pair | | | | | | | r1 | r2 |
| Quantile result | | | | | | | | 50 |
-* *q(0.0 <= .071) = 10* Use Boundary Rule 2: *0.0 <= .071*, return *10*.
+* *q(_0.0_ <= .071) = 10* Use Boundary Rule QI-1: *_0.0_ <= .071*, return *10*.
| Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40
| 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | | 1 | 3 | 5 | 7 | 9 | 11 | 13
| 14 |
| Normalized Rank[]: | | .071 | .214 | .357 | .500 | .643 | .786 | .929
| 1.0 |
-| Rank input | 0.0 | | | | | | |
| |
+| Rank input | _0.0_| | | | | | |
| |
| Qualifying pair | (r1) | r2 | | | | | |
| |
| Rank result | | 10 | | | | | |
| |
--------
-### ***quantile(rank, EXCLUSIVE)*** or ***q(r, GT)*** :=<br>Given *r*, return
the quantile, *q*, of the smallest rank that is strictly Greater Than *r*.
+### ***quantile(rank, EXCLUSIVE)*** or ***Q(r, GT)*** :=<br>Given *r*, return
the quantile, *q*, of the smallest rank that is strictly Greater Than *r*.
-<b>Implementation:</b>
+*Implementation: Normal Rule*
* Given *r*, search the rank array until we find the adjacent pair *{r1, r2}*
where *r1 <= r < r2*.
* Return the quantile, *q*, associated with *r2*, the second of the pair.
-<b>Boundary Exceptions:</b>
+*Boundary Exceptions:*
+
+* **Boundary Rule QE-1:** If the given normalized rank, *r*, is equal to 1.0,
there is no rank greater than 1.0, nor a corresponding quantile that satisfies
this criterion. However, for convenience, the function will return the maximum
item of the input stream.
-* **Boundary Rule 1:** If the given normalized rank, *r*, is equal to 1.0,
there is no quantile that satisfies this criterion. However, for convenience,
the function will return quantile associated with the largest cumulative rank
retained by the sketch.
-* **Boundary Rule 2:** If the given normalized rank, *r*, is less than the
smallest rank, the function will return the quantile associated with the
smallest cumulative rank retained by the sketch.
-<b>Examples using normalized ranks:</b>
+*Examples using normalized ranks. The given value is underlined.*
-* *q(.357) = 30* Normal rule applies: *.357 <= .357 < .500*, return *q(r2) =
30*.
+* *q(_.357_) = 30* Normal rule applies: *.357 <= _.357_ < .500*, return *Q(r2)
= 30*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
-| Rank input | | | .357 | | | | | |
+| Rank input | | |_.357_| | | | | |
| Qualifying pair | | | r1 | r2 | | | | |
| Quantile result | | | | 30 | | | | |
-* *q(1.0) = 50* Use Boundary Rule 1 *1.0 <= 1.0 < ?*, return *50*.
+* *q(_1.0_) = 50* Use Boundary Rule QE-1: *_1.0_ = 1.0*, return *50*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50
| ? |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14
| |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0
| |
-| Rank input | | | | | | | | 1.0
| |
+| Rank input | | | | | | | | _1.0_
| |
| Qualifying pair | | | | | | | | r1
| (r2) |
| Quantile result | | | | | | | |
| 50 |
-* *q(0.99) = 50* Normal rule applies *.929 <= .99 < 1.0*, return *50*.
+* *q(_0.99_) = 50* Normal rule applies: *.929 <= _.99_ < 1.0*, return *Q(r2) =
50*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
-| Rank input | | | | | | | .99 | |
+| Rank input | | | | | | | _.99_| |
| Qualifying pair | | | | | | | r1 | r2 |
| Quantile result | | | | | | | | 50 |
-* *q(0.0 <= .071) = 10* Use Boundary Rule 2: *0.0 < .071*, return *10*.
+* *q(_0.0_) = 10* Normal rule applies: *0.0 <= _0.0_ < .071*, return *10*.
| Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40
| 50 |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | | 1 | 3 | 5 | 7 | 9 | 11 | 13
| 14 |
| Normalized Rank[]: | | .071 | .214 | .357 | .500 | .643 | .786 | .929
| 1.0 |
-| Rank input | 0.0 | | | | | | |
| |
+| Rank input | _0.0_| | | | | | |
| |
| Qualifying pair | (r1) | r2 | | | | | |
| |
| Rank result | | 10 | | | | | |
| |
@@ -354,72 +403,40 @@ These next examples use a small data set that mimics what
could be the result of
### Note: This rule is marginal in its usefulness so it is not currently
implemented.
-<b>Implementation:</b>
+*Implementation: Normal Rule*
* Given *r*, search the rank array until we find the adjacent pair *{r1, r2}*
where *r1 <= r < r2*.
* Return the quantile, *q*, associated with *r2*, the second of the pair.
-<b>Boundary Exceptions:</b>
+*Boundary Exceptions:*
-* **Boundary Rule 1:** If the given normalized rank, *r*, is equal to *1.0*,
there is no quantile that satisfies this criterion. Return *NaN* or *null*.
-* **Boundary Rule 2:** If the given normalized rank, *r*, is less than the
smallest rank, the function will return the quantile associated with the
smallest cumulative rank retained by the sketch..
+* **Boundary Rule QES-1:** If the given normalized rank, *r*, is equal to
*1.0*, there is no quantile that satisfies this criterion. Return *NaN* or
*null*.
-<b>Examples using normalized ranks:</b>
+*Examples using normalized ranks. The given value is underlined.*
-* *q(.357) = 30* Normal rule applies: *.357 <= .357 < .500*, return *q(r2) =
30*.
+*There is only one case where this strict case would apply:*
-| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
-| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
-| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
-| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.000 |
-| Rank input | | | .357 | | | | | |
-| Qualifying pair | | | r1 | r2 | | | | |
-| Quantile result | | | | 30 | | | | |
-
-* *q(1.0) = 50* Use Boundary Rule 1 *1.0 <= 1.0 < ?*, return *NaN or null*.
+* *q(_1.0_) = Nan or null* Use Boundary Rule QES-1: *_1.0_ = 1.0*, return *NaN
or null*.
| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50
| ? |
| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14
| |
| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0
| |
-| Rank input | | | | | | | | 1.0
| |
+| Rank input | | | | | | | | _1.0_
| |
| Qualifying pair | | | | | | | | r1
| (r2) |
| Quantile result | | | | | | | |
| NaN or null |
-* *q(0.99) = 50* Normal rule applies *.929 <= .99 < 1.0*, return *50*.
-
-| Quantile[]: | 10 | 20 | 20 | 30 | 30 | 30 | 40 | 50 |
-| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- |
-| Natural Rank[]: | 1 | 3 | 5 | 7 | 9 | 11 | 13 | 14 |
-| Normalized Rank[]: | .071 | .214 | .357 | .500 | .643 | .786 | .929 | 1.0 |
-| Rank input | | | | | | | .99 | |
-| Qualifying pair | | | | | | | r1 | r2 |
-| Quantile result | | | | | | | | 50 |
-
-* *q(0.0 <= .071) = 10* Use Boundary Rule 2: *0.0 < .071*, return *10*.
-
-| Quantile[]: | ? | 10 | 20 | 20 | 30 | 30 | 30 | 40
| 50 |
-| ------------------ | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -----
| ----- |
-| Natural Rank[]: | | 1 | 3 | 5 | 7 | 9 | 11 | 13
| 14 |
-| Normalized Rank[]: | | .071 | .214 | .357 | .500 | .643 | .786 | .929
| 1.0 |
-| Rank input | 0.0 | | | | | | |
| |
-| Qualifying pair | (r1) | r2 | | | | | |
| |
-| Rank result | | 10 | | | | | |
| |
-
-
## These inequality functions maintain the 1:1 functional relationship,
approximately.
-### The *exclusive* search for q(r) is the inverse of the *exclusive* search
for r(q).
+### The *exclusive* search for Q(r) is the inverse of the *exclusive* search
for R(q).
##### Therefore, *q = q(r(q))* and *r = r(q(r))*.
-### The *inclusive* search for q(r) is the inverse of the *inclusive* search
for r(q).
+### The *inclusive* search for Q(r) is the inverse of the *inclusive* search
for R(q).
##### Therefore, *q = q(r(q))* and *r = r(q(r))*.
## Summary
-The power of these inequality search algorithms is that they produce
repeatable and accurate results, are insensitive to duplicates and sketch
deletions, and maintain the property of 1:1 functions.
-
-
+The power of these inequality search algorithms is that they produce
repeatable and accurate results, are insensitive to duplicates and sketch
deletions, and maintain the property of 1:1 functions approximately.
diff --git a/src/main/java/org/apache/datasketches/Files.java
b/src/main/java/org/apache/datasketches/Files.java
index 0a6916bd..91db7834 100644
--- a/src/main/java/org/apache/datasketches/Files.java
+++ b/src/main/java/org/apache/datasketches/Files.java
@@ -34,7 +34,6 @@ import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.RandomAccessFile;
import java.io.Reader;
-import java.nio.BufferUnderflowException;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
diff --git a/src/main/resources/docgen/toc.json
b/src/main/resources/docgen/toc.json
index 6df97ec0..35c39195 100644
--- a/src/main/resources/docgen/toc.json
+++ b/src/main/resources/docgen/toc.json
@@ -193,10 +193,10 @@
[
{"class":"Doc", "desc" : "Quantiles and Ranks Tutorial",
"dir" : "Quantiles", "file": "SketchingQuantilesAndRanksTutorial"},
{"class":"Doc", "desc" : "Quantiles Overview",
"dir" : "Quantiles", "file": "QuantilesOverview" },
- {"class":"Doc", "desc" : "KLL Floats sketch",
"dir" : "KLL", "file": "KLLSketch" },
+ {"class":"Doc", "desc" : "KLL Sketch Family",
"dir" : "KLL", "file": "KLLSketch" },
{"class":"Doc", "desc" : "KLL Sketch Accuracy and Size",
"dir" : "KLL", "file": "KLLAccuracyAndSize" },
{"class":"Doc", "desc" : "REQ Floats sketch",
"dir" : "REQ", "file": "ReqSketch" },
- {"class":"Doc", "desc" : "Original QuantilesSketch",
"dir" : "Quantiles", "file": "OrigQuantilesSketch" },
+ {"class":"Doc", "desc" : "Classic Quantiles Sketches",
"dir" : "Quantiles", "file": "ClassicQuantilesSketch" },
{ "class":"Dropdown", "desc" : "Quantiles Examples", "array":
[
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]