[incubator-datasketches-website] branch master updated: Update Downloads page and Major Sketch Families page.

leerho Sat, 25 Jan 2020 17:46:25 -0800

This is an automated email from the ASF dual-hosted git repository.

leerho pushed a commit to branch master
in repository 
https://gitbox.apache.org/repos/asf/incubator-datasketches-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 1be1c8a  Update Downloads page and Major Sketch Families page.
1be1c8a is described below

commit 1be1c8af77b4e0821ebf4bec1d9320695cdda682
Author: Lee Rhodes <[email protected]>
AuthorDate: Sat Jan 25 17:45:14 2020 -0800

    Update Downloads page and Major Sketch Families page.
---
 docs/MajorSketchFamilies.md | 71 +++++++++++++++++++++++++++++++++++----------
 docs/downloads.md           | 19 +++++++-----
 2 files changed, 68 insertions(+), 22 deletions(-)

diff --git a/docs/MajorSketchFamilies.md b/docs/MajorSketchFamilies.md
index c4c0a58..a4c3fc8 100644
--- a/docs/MajorSketchFamilies.md
+++ b/docs/MajorSketchFamilies.md
@@ -19,12 +19,52 @@ layout: doc_page
     specific language governing permissions and limitations
     under the License.
 -->
-# Cardinality Sketches
+# Sketch Capability Matrix
 
-## CPC Sketch: Estimating Stream Cardinalities more efficiently than the 
famous HLL sketch!
+<table>
+<tr style="font-weight:bold"><td colspan=2></td><td 
colspan=3>Languages</td><td colspan=4>Set Operations</td><td colspan=4>System 
Integrations</td><td colspan=5>Misc.</td></tr>
+
+<tr 
style="font-weight:bold"><td>Type</td><td>Sketch</td><td>Java</td><td>C++</td><td>Python</td><td>Union</td><td>Intersection</td><td>Difference</td><td>Jaccard</td><td>Hive</td><td>Pig</td><td>Druid<sup>1</sup></td><td>Spark<sup>2</sup></td><td>Concurrent</td><td>Compact</td><td>Java
 Generics</td><td>Off-Heap</td><td>Error Bounds</td></tr>
+
+<tr style="font-weight:bold"><td colspan=18>Major Sketches</td></tr>
+<tr><td>Cardinality/FM85</td><td>CpcSketch</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td>Y</td><td>Y</td><td></td><td></td><td></td><td>Y</td><td></td><td></td><td>Y</td></tr>
+<tr><td>Cardinality/FM85</td><td>HllSketch</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td>Y</td></tr>
+<tr><td>Cardinality/Theta</td><td>Sketch</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td></td><td>Y</td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>Sketch</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td>Y</td></tr>
+<tr><td>Quantiles/Cormode</td><td>DoublesSketch</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td>Y</td><td>Y</td></tr>
+<tr><td>Quantiles/Cormode</td><td>ItemsSketch</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td></td><td>Y</td></tr>
+<tr><td>Quantiles/KLL</td><td>FloatsSketch</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Frequencies</td><td>LongsSketch</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Frequencies</td><td>ItemsSketch</td><td>Y</td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td></td><td>Y</td></tr>
+<tr><td>Sampling</td><td>ReservoirLongsSketch</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Sampling</td><td>ReseerviorItemsSketch</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td></td><td>Y</td></tr>
+<tr><td>Sampling</td><td>VarOptItemsSketch</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td></td><td></td><td></td><td></td><td>Y</td><td></td><td>Y</td></tr>
+
+<tr style="font-weight:bold"><td colspan=18>Specialty Sketches</td></tr>
+
+<tr><td>Cardinality/FM85</td><td>UniqueCountMap</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>FdtSketch</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>ArrayOfDoubles</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td>Y</td><td></td><td>Y</td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>A 
double</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>An 
integer</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>ArrayOfStrings</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+<tr><td>Cardinality/Tuple</td><td>User 
Engagement</td><td>Y</td><td></td><td></td><td>Y</td><td>Y</td><td>Y</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td></td><td>Y</td></tr>
+</table>
+
+<sup>1</sup> Integrated into Druid<br>
+<sup>2</sup> Example Code
+
+----
+
+
+# High-Level Descriptions
+
+## Cardinality Sketches
+
+### CPC Sketch: Estimating Stream Cardinalities more efficiently than the 
famous HLL sketch!
 This sketch was developed by the late Keven Lang, our chief scientist at the 
time, is an amazing *tour de force* of scientific design and engineering and 
has substantially better accuracy / per stored size than the famous HLL sketch. 
The theory and demonstration of its performance is detailed in Lang's paper 
[Back to the Future: an Even More Nearly Optimal Cardinality Estimation 
Algorithm](https://arxiv.org/abs/1708.06839).  This sketch is available in 
Java, C++ and Python. 
 
-## [Theta Sketches]({{site.docs_dir}}/Theta/ThetaSketchFramework.html): 
Estimating Stream Expression Cardinalities
+### [Theta Sketches]({{site.docs_dir}}/Theta/ThetaSketchFramework.html): 
Estimating Stream Expression Cardinalities
 Internet content, search and media companies like Yahoo, Google, Facebook, 
etc., collect many tens of billions of event records from the many millions of 
users to their web sites each day.  These events can be classified by many 
different dimensions, such as the page visited and user location and profile 
information.  Each event also contains some unique identifiers associated with 
the user, specific device (cell phone, tablet, or computer) and the web browser 
used.  
 
 <img class="doc-img-full" src="{{site.docs_img_dir}}/PeopleCloud.png" 
alt="PeopleCloud" />
@@ -36,7 +76,7 @@ However, if an approximate answer to these problems is 
acceptable, [Theta Sketch
 
 The 
[theta/Sketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/theta/Sketch.java)
 can operate both on-heap and off-heap, has powerful Union, Intersection, AnotB 
and Jaccard operators, has a high-performance concurrent form for 
multi-threaded environments, has both immutable compact, and updatable 
representations, and is quite fast. It is available in Java, C++ and Python. 
Because of its flexibility, it is one of the most popular [...]
 
-## [Tuple Sketches]({{site.docs_dir}}/Tuple/TupleOverview.html): Extending 
Theta Sketches to Perform Associative Analysis 
+### [Tuple Sketches]({{site.docs_dir}}/Tuple/TupleOverview.html): Extending 
Theta Sketches to Perform Associative Analysis 
 It is often not enough to perform stream expressions on sets of unique 
identifiers, it is very valuable to be able to associate additive data with 
these identifiers, such as impression counts, clicks or timestamps.  Tuple 
Sketches are a natural extension of the Theta sketch and have Java Genric 
forms, that enable the user do define the sketch with arbitrary "summary" data. 
 The 
[tuple/Sketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datas
 [...]
 
 The Tuple sketch is effectively infinitely extendable and there are several 
common variants of the Tuple Sketch, which also serve as examples on how to 
extend the base classes, that are also in the library, including:
@@ -44,19 +84,20 @@ The Tuple sketch is effectively infinitely extendable and 
there are several comm
 - 
[tuple/adouble/DoubleSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/adouble/DoubleSketch.java)
 with a single column of *double* values as the *summary*.
 - 
[tuple/aninteger/IntegerSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/aninteger/IntegerSketch.java)
 with a single column of *int* values as the *summary*.
 - 
[tuple/strings/ArrayOfStringsSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/strings/ArrayOfStringsSketch.java),
 which is effectively a variable number of columns of strings as the *summary*.
-- 
[tuple/ArrayOfDoublesSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/ArrayOfDoublesSketch.java),
 which is effectively a variable number of columns of double values as the 
*summary*. This variant also provides both on-heap and off-heap operation.
+- 
[tuple/ArrayOfDoublesSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/ArrayOfDoublesSketch.java),
 which enables the user to specify the number of columns of double values as 
the *summary*. This variant also provides both on-heap and off-heap operation.
+
 
-## [HyperLogLog Sketches]({{site.docs_dir}}/HLL/HLL.html): Estimating Stream 
Cardinalities
+### [HyperLogLog Sketches]({{site.docs_dir}}/HLL/HLL.html): Estimating Stream 
Cardinalities
 The HyperLogLog (HLL) is a cardinality sketch similar to the above Theta 
sketches except they are anywhere from 2 to 16 times smaller in size.  The HLL 
sketches can be Unioned, but set intersection and difference operations are not 
provided intrinsically, because the resulting error would be quite poor.  If 
your application only requires cardinality estimation and Unioning and space is 
at a premium, the HLL sketch provided could be your best choice. 
 
 The 
[hll/HllSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/HllSketch.java)
 can operate both on-heap and off-heap, provides the Union operators, and has 
both immutable compact, and updatable representations. It is available in Java, 
C++ and Python.
 
-## [HyperLogLog Map Sketch]({{site.docs_dir}}/HLL/HllMap.html): Estimating 
Stream Cardinalities of Key-Value Pairs
+### [HyperLogLog Map Sketch]({{site.docs_dir}}/HLL/HllMap.html): Estimating 
Stream Cardinalities of Key-Value Pairs
 This is a specially designed sketch that addresses the problem of individually 
tracking value cardinalities of Key-Value (K,V) pairs in real-time, where the 
number of keys can be very large, such as IP addresses, or Geo identifiers, 
etc. Assigning individual sketches to each key would create unnecessary 
overhead. This sketch streamlines the process with much better space 
management.  This 
[hllmap/UniqueCountMap](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/j
 [...]
 
-# Quantiles Sketches
+## Quantiles Sketches
 
-## [Quantiles Sketches]({{site.docs_dir}}/Quantiles/QuantilesOverview.html): 
Estimating Distributions from a Stream of Values
+### [Quantiles Sketches]({{site.docs_dir}}/Quantiles/QuantilesOverview.html): 
Estimating Distributions from a Stream of Values
 There are many situations where is valuable to understand the distribution of 
values in a stream. For example, from a stream of web-page time-spent values, 
it would be useful to know arbitrary quantiles of the distribution, such as the 
25th percentile value, the median value and the 75th percentile value. The 
[Quantiles Sketches]({{site.docs_dir}}/Quantiles/QuantilesOverview.html) solve 
this problem and enable the inverse functions such as the Probability Mass 
Function (PMF) and the Cumu [...]
 
 <img class="doc-img-full" 
src="{{site.docs_img_dir}}/quantiles/TimeSpentHistogram.png" 
alt="TimeSpentHistogram" />
@@ -65,27 +106,27 @@ There are two different families of quantiles sketches, 
the original [quantiles/
 
 Later we developed the 
[kll/KllFloatsSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/kll/KllFloatsSketch.java)
  (Named after its authors), which is also a quantiles sketch, that achieves 
near optimal small size for a given accuracy. It is only available on-heap. It 
is available in Java, C++ and Python.
 
-# Frequent Items / Heavy Hitters Sketches
+## Frequent Items / Heavy Hitters Sketches
 
-## [Frequent Items 
Sketches]({{site.docs_dir}}/Frequency/FrequentItemsOverview.html): Finding the 
Heavy Hitter Objects from a Stream
+### [Frequent Items 
Sketches]({{site.docs_dir}}/Frequency/FrequentItemsOverview.html): Finding the 
Heavy Hitter Objects from a Stream
 It is very useful to be able to scan a stream of objects, such as song titles, 
and be able to quickly identify those items that occur most frequently.  The 
term <i>Heavy Hitter</i> is defined to be an item that occurs more frequently 
than some fractional share of the overall count of items
 in the stream including duplicates.  Suppose you have a stream of 1M song 
titles, but in that stream there are only 100K song titles that are unique. If 
any single title consumes more than 10% of the stream elements it is a Heavy 
Hitter, and the 10% is a threshold parameter we call epsilon.
 
 The accuracy of a Frequent Items Sketch is proportional to the configured size 
of the sketch, the larger the sketch, the smaller is the epsilon threshold that 
can detect Heavy Hitters. This sketch is available in two forms, as the 
[frequencies/LongsSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/frequencies/LongsSketch.java)
 used for processing a stream of tuples {*long*, weight}, and the 
[frequencies/ItemsSketch](https://gi [...]
 
-## [Frequent Distinct Tuples 
Sketch]({{site.docs_dir}}/Frequency/FrequentDistinctTuplesSketch.html): Finding 
the Heavy Hitter tuples from a Stream.
+### [Frequent Distinct Tuples 
Sketch]({{site.docs_dir}}/Frequency/FrequentDistinctTuplesSketch.html): Finding 
the Heavy Hitter tuples from a Stream.
 Suppose our data is a stream of pairs {IP address, User ID} and we want to 
identify the IP addresses that have the most distinct User IDs. Or conversely, 
we would like to identify the User IDs that have the most distinct IP 
addresses. This is a common challenge in the analysis of big data and the FDT 
sketch helps solve this problem using probabilistic techniques.
 
 More generally, given a multiset of tuples with *N* dimensions *{d1,d2, d3, …, 
dN}*, and a primary subset of dimensions *M < N*, our task is to identify the 
combinations of *M* subset dimensions that have the most frequent number of 
distinct combinations of the *N - M* non-primary dimensions.
 
 The 
[fdt/FdtSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/fdt/FdtSketch.java)
 is currently only available in Java, but because it is an extension of the 
Tuple Sketch family, it inherits many of the same properties: it can operate 
both on-heap and off-heap, includes both Union and Intersection operators, has 
both immutable compact, and updatable representations.
 
-## Frequent Directions: Distributed, mergeable Singular Value Decomposition 
+### Frequent Directions: Distributed, mergeable Singular Value Decomposition 
 Part of a new separate sketches-vector package, Frequent Directions is in many 
ways a generalization of the Frequent Items sketch to handle vector data. This 
sketch computes an approximate singular value decomposition (SVD) of a matrix, 
providing a projection matrix that can be used for dimensionality reduction. 
SVD is a key technique in many recommender systems, providing shopping 
suggestions based on a customer's past purchases compared with other similar 
customers. This sketch is stil [...]
 
-# Sampling Sketches
+## Sampling Sketches
 
-## [Sampling Sketches]({{site.docs_dir}}/Sampling/ReservoirSampling.html): 
Uniform Sampling of a Stream into a fixed size space
+### [Sampling Sketches]({{site.docs_dir}}/Sampling/ReservoirSampling.html): 
Uniform Sampling of a Stream into a fixed size space
 This family of sketches implements an enhanced version of the famous Reservoir 
sampling algorithm and extends it with the capabilities that large-scale 
distributed systems really need: mergability (even with different sized 
sketches), uses Java Generics so that the base classes can be trivially 
extended for any input type (even polymorphic types), and an extensible means 
of performing serialization and deserialization. 
 
 The 
[sampling/ReservoirLongsSketch](https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/sampling/ReservoirLongsSketch.java)
 accepts a stream of *long* values as identifiers with a weight of one, and 
produces a result Reservoir of a pre-determined size that represents a uniform 
random sample of the stream.
diff --git a/docs/downloads.md b/docs/downloads.md
index 472ffd9..25bedd7 100644
--- a/docs/downloads.md
+++ b/docs/downloads.md
@@ -21,29 +21,34 @@ layout: doc_page
 -->
 ## Downloads
 
+### Download Zip Files
 Choose the most recent release version from 
 
[incubator-datasketches-xxx](https://www.apache.org/dyn/closer.cgi?path=/incubator/datasketches).
 
-Or, clone or fork the current SNAPSHOT directly from the relevant repository.
+### Download Java Jar Files
+From [Maven 
Central](https://repository.apache.org/content/repositories/releases/org/apache/datasketches).
+
+### Download Shapshot Versions
+Clone or fork the current SNAPSHOT directly from the relevant [DataSketches 
repository](https://github.com/apache?utf8=%E2%9C%93&q=datasketches).
 
 ### Version Numbers
 Apache DataSketches uses [semantic versioning](https://semver.org/). Version 
numbers use the form major.minor.incremental and are incremented as follows:
 
-* __major version__ for incompatible API changes
-* __minor version__ for new functionality added in a backward-compatible manner
-* __incremental version__ for forward-compatible bug fixes
+* __major version__ for major new functionality and/or major API changes that 
may be incompatible with prior versions
+* __minor version__ for new functionality and scheduled bug fixes. This should 
be API compatible with prior versions
+* __incremental version__ for unscheduled bug fixes only
 
 The zip files downloaded from 
[incubator-datasketches-xxx](https://www.apache.org/dyn/closer.cgi?path=/incubator/datasketches)
 include a version number in the name, as in 
_apache-datasketches-java-1.1.0-incubating-src.zip_. 
 This same number is also in the top section of the pom.xml file.
 
-If you are developing using Maven and want to use, for example, 
incubator-datasketches-java, add the following dependencies to your pom.xml 
file:
+If you are developing using Maven and want to use, for example, 
datasketches-java, add the following dependencies to your pom.xml file:
 
 ```
 <dependency>
   <groupId>org.apache.datasketches</groupId>
   <artifactId>datasketches-java</artifactId>
-  <version>1.1.0-incubating</version>
+  <version>1.2.0-incubating</version>
 </dependency>
 ```
 
@@ -67,7 +72,7 @@ dependency.
 If you just want to run Hive and don't require direct access to the 
<i>incubator-datasketches-java</i> it is
 recommended that you download the "with-shaded-core.jar", which includes the 
Hive jar as well as 
 shaded versions of the core jar and memory jar. The shading avoids conflicts 
with other possible versions
-of core and memory that you might have in your system.
+of core Java and Memory that you might have in your system.
 
 
 #### SNAPSHOT Jars


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-datasketches-website] branch master updated: Update Downloads page and Major Sketch Families page.

Reply via email to