[incubator-datasketches-postgresql] branch master updated: updated doc and package

alsay Fri, 14 Jun 2019 16:16:52 -0700

This is an automated email from the ASF dual-hosted git repository.

alsay pushed a commit to branch master
in repository 
https://gitbox.apache.org/repos/asf/incubator-datasketches-postgresql.git



The following commit(s) were added to refs/heads/master by this push:
     new d2436ef  updated doc and package
d2436ef is described below

commit d2436efa99476b49ae1e5107243b494c6d93ba33
Author: AlexanderSaydakov <[email protected]>
AuthorDate: Fri Jun 14 16:15:39 2019 -0700

    updated doc and package
---
 META.json            |   8 ++--
 README.md            | 126 +++++++++++++++++++++++++++++++++++++++++++++++++--
 datasketches.control |   2 +-
 3 files changed, 126 insertions(+), 10 deletions(-)

diff --git a/META.json b/META.json
index 51852f5..98a2747 100644
--- a/META.json
+++ b/META.json
@@ -1,7 +1,7 @@
 {
    "name": "datasketches",
    "abstract": "approximate algorithms for big data analysis",
-   "version": "1.1.0",
+   "version": "1.2.0",
    "maintainer": [
      "Alexander Saydakov <[email protected]>",
      "Sketches User List <[email protected]>"
@@ -31,11 +31,11 @@
    },
    "resources": {
       "bugtracker": {
-         "web": "https://github.com/DataSketches/sketches-postgres/issues";
+         "web": 
"https://github.com/apache/incubator-datasketches-postgresql/issues";
       },
       "repository": {
-        "url":  "https://github.com/DataSketches/sketches-postgres.git";,
-        "web":  "https://github.com/DataSketches/sketches-postgres";,
+        "url":  
"https://github.com/apache/incubator-datasketches-postgresql.git";,
+        "web":  "https://github.com/apache/incubator-datasketches-postgresql";,
         "type": "git"
       }
    },
diff --git a/README.md b/README.md
index cedf49d..ed9bf43 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,12 @@
-Module for PostgreSQL to support approximate algorithms based on the 
Datasketches core library sketches-core-cpp.
-See https://datasketches.github.io/ for details.
+Module for PostgreSQL to support approximate algorithms based on the 
Datasketches core library datasketches-cpp.
+See [Datasketches documentation](https://datasketches.github.io/) for details.
 
-This module currently supports two sketches:
+This module currently supports the following sketches:
 
 - CPC (Compressed Probabilistic Counting) sketch - very compact (when 
serialized) distinct-counting sketch
+- Theta sketch - distinct counting with set operations (intersection, a-not-b)
 - KLL float quantiles sketch - for estimating distributions: quantile, rank, 
PMF (histogram), CDF
+- Frequent strings sketch - capture the heaviest items (strings) by count or 
by some other weight
 
 <h1>Examples</h1>
 
@@ -14,7 +16,7 @@ Suppose 100 million random integer values uniformly 
distributed in the range fro
 
 Exact count distinct:
 
-       $ time psql test -c "select count(distinct id) from random_ints_100m;"
+       $ time psql test -c "select count(distinct id) from random_ints_100m"
          count
        ----------
         63208457
@@ -24,7 +26,7 @@ Exact count distinct:
 
 Approximate count distinct:
 
-       $ time psql test -c "select cpc_sketch_distinct(id) from 
random_ints_100m;"
+       $ time psql test -c "select cpc_sketch_distinct(id) from 
random_ints_100m"
         cpc_sketch_distinct 
        ---------------------
            63423695.9451363
@@ -45,6 +47,60 @@ Merging sketches:
        -------------------------
                3.00024414612919
 
+<h2>Distinct counting with Theta sketch</h2>
+
+See above for the exact distinct count of 100 million random integers
+
+Approximate distinct count:
+
+       $ time psql test -c "select theta_sketch_distinct(id) from 
random_ints_100m"
+        theta_sketch_distinct 
+       -----------------------
+             64593262.4373193
+       (1 row)
+
+       real    0m19.701s
+
+Note that the above one-off distinct count is just to show the basic usage. 
Most importantly, the sketch can be used as an "additive" distinct count metric 
in a data cube.
+
+Aggregate union:
+
+       create table theta_sketch_test(sketch theta_sketch);
+       insert into theta_sketch_test select theta_sketch_build(1);
+       insert into theta_sketch_test select theta_sketch_build(2);
+       insert into theta_sketch_test select theta_sketch_build(3);
+       select theta_sketch_get_estimate(theta_sketch_union(sketch)) from 
theta_sketch_test;
+        theta_sketch_get_estimate 
+       ---------------------------
+                                3
+
+Non-aggregate set operations:
+
+       create table theta_set_op_test(sketch1 theta_sketch, sketch2 
theta_sketch);
+       insert into theta_set_op_test select theta_sketch_build(1), 
theta_sketch_build(1);
+       insert into theta_set_op_test select theta_sketch_build(1), 
theta_sketch_build(2);
+
+       select theta_sketch_get_estimate(theta_sketch_union(sketch1, sketch2)) 
from theta_set_op_test;
+        theta_sketch_get_estimate 
+       ---------------------------
+                                1
+                                2
+       (2 rows)
+
+       select theta_sketch_get_estimate(theta_sketch_intersection(sketch1, 
sketch2)) from theta_set_op_test;
+        theta_sketch_get_estimate 
+       ---------------------------
+                                1
+                                0
+       (2 rows)
+
+       select theta_sketch_get_estimate(theta_sketch_a_not_b(sketch1, 
sketch2)) from theta_set_op_test;
+        theta_sketch_get_estimate 
+       ---------------------------
+                                0
+                                1
+       (2 rows)
+
 <h2>Estimating quanitles, ranks and histograms with KLL sketch</h2>
 
 Table "normal" has 1 million values from the normal distribution with mean=0 
and stddev=1.
@@ -91,3 +147,63 @@ The ARRAY[-2, -1, 0, 1, 2] of 5 split points defines 6 
intervals (bins): (-inf,-
 In this simple example we know the value of N since we constructed this 
sketch, but in a general case sketches are merged across dimensions of data 
hypercube, so the vale of N is not known in advance.
 
 Note that the normal distribution was used just to show the basic usage. The 
sketch does not make any assumptions about the distribution.
+
+<h2>Frequent strings</h2>
+
+Consider a numeric Zipfian distribution with parameter alpha=1.1 (high skew)
+and range of 2<sup>13</sup>, so that the number 1 has the highest frequency,
+the number 2 appears substantially less frequently and so on.
+Suppose zipf\_1p1\_8k\_100m table has 100 million random values drawn from such
+a distribution, and the values are converted to strings.
+
+Suppose the goal is to get the most frequent strings from this table. In
+terms of the frequent items sketch we have to chose a threshold. Let's try
+to capture values that repeat more than 1 million times, or more than 1% of
+the 100 million entries in the table. According to the [error 
table](https://datasketches.github.io/docs/Frequency/FrequentItemsErrorTable.html),
+frequent items sketch of size 2<sup>9</sup> must capture all values more
+frequent then about 0.7% of the input.
+
+The following query is to build a sketch with lg_k=9 and get results with
+estimated weight above 1 million using "no false negatives" policy.
+The output format is: value, estimate, lower bound, upper bound.
+
+       $ time psql test -c "select 
frequent_strings_sketch_result_no_false_negatives(frequent_strings_sketch_build(9,
 value), 1000000) from zipf_1p1_8k_100m"
+        frequent_strings_sketch_result_no_false_negatives 
+       ---------------------------------------------------
+        (1,15328953,15209002,15328953)
+        (2,7156065,7036114,7156065)
+        (3,4578361,4458410,4578361)
+        (4,3334808,3214857,3334808)
+        (5,2608563,2488612,2608563)
+        (6,2135715,2015764,2135715)
+        (7,1801961,1682010,1801961)
+        (8,1557433,1437482,1557433)
+        (9,1368446,1248495,1368446)
+        (10,1216532,1096581,1216532)
+        (11,1098304,978353,1098304)
+       (11 rows)
+       
+       real    0m38.178s
+
+Here is an equivalent exact computation:
+
+       $ time psql test -c "select value, weight from (select value, count(*) 
as weight from zipf_1p1_8k_100m group by value) t where weight > 1000000 order 
by weight desc"
+        value |  weight  
+       -------+----------
+        1     | 15328953
+        2     |  7156065
+        3     |  4578361
+        4     |  3334808
+        5     |  2608563
+        6     |  2135715
+        7     |  1801961
+        8     |  1557433
+        9     |  1368446
+        10    |  1216532
+        11    |  1098304
+       (11 rows)
+       
+       real    0m18.362s
+
+In this particular case the exact computation happens to be faster. This is
+just to show the basic usage. Most importantly, the sketch can be used as an 
"additive" metric in a data cube, and can be easily merged across dimensions.
\ No newline at end of file
diff --git a/datasketches.control b/datasketches.control
index 72634af..6fdc30f 100644
--- a/datasketches.control
+++ b/datasketches.control
@@ -1,6 +1,6 @@
 # Datasketches module
 comment = 'Aggregation functions and data types for approximate algorithms.'
-default_version = '1.0.0'
+default_version = '1.2.0'
 relocatable = true
 
 module_pathname = '$libdir/datasketches'


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-datasketches-postgresql] branch master updated: updated doc and package

Reply via email to