This is an automated email from the ASF dual-hosted git repository. leerho pushed a commit to branch check_in_readme2 in repository https://gitbox.apache.org/repos/asf/datasketches-bigquery.git
commit c4694f7ab416d3142bcdce214277cce78a62b173 Author: Lee Rhodes <[email protected]> AuthorDate: Wed Sep 11 17:39:10 2024 -0700 check in README2 --- README2.md | 305 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 305 insertions(+) diff --git a/README2.md b/README2.md new file mode 100644 index 0000000..e05dc61 --- /dev/null +++ b/README2.md @@ -0,0 +1,305 @@ +# Supporting Datasketches in BigQuery + +## Contents +<!-- TOC --> +* [Introduction](#introduction) +* [Featured Sketches in this Integration](#featured-sketches-in-this-integration) +* [Solution Approach](#solution-approach) +* [Theta Sketch](#theta-sketch) + * [Lg_k - Precision parameter](#lgk---precision-parameter) + * [Examples](#examples) +* [KLL Sketch](#kll-sketch) + * [K - Precision Parameter](#k---precision-parameter) + * [Examples](#examples-1) +* [Tuple Sketch](#tuple-sketch) + * [Lg_k - Precision parameter](#lgk---precision-parameter-1) + * [Examples](#examples-2) +<!-- TOC --> + +<a id="introduction"></a> +## Introduction +This project enhances Google BigQuery by integrating a suite of powerful sketch functions from [Apache DataSketches](https://datasketches.apache.org/), enabling efficient probabilistic data analysis on massive datasets. + +Apache Datasketches is an open source, high-performance library of stochastic streaming algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs that process massive data as a stream and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods. + +<a id="featured-sketches-in-this-integration"></a> +## Featured Sketches in this Integration + +This project focuses on incorporating the following sketch types + +1. [**Theta Sketch**](#theta-sketch): A sketch ideal for cardinality estimation and set operations (union, intersection, difference). +2. [**KLL Sketch**](#kll-sketch): A sketch designed for quantile estimation. +3. [**Tuple Sketch**](#tuple-sketch): An extension of the Theta Sketch that supports associating values with the estimated unique items. + +<a id="solution-approach"></a> +## Solution Approach + +Custom sketch C++ implementation using Apache Datasketches [C++ core library](https://github.com/apache/datasketches-cpp) is compiled to [WebAssembly](https://webassembly.org/) (WASM) libraries using [emscripten](https://emscripten.org/) toolchain and loaded in BigQuery [JS UDAFs](https://cloud.google.com/bigquery/docs/user-defined-aggregates) and [JS UDFs](https://cloud.google.com/bigquery/docs/user-defined-functions#javascript-udf-structure) + +Note: +- All the functions defined below are deployed in `bqutil.fn` and `bqutil.fn_<bq_region>` public datasets +- You can also deploy these functions in your own dataset. Refer to this [README](../README.md). + +<a id="theta-sketch"></a> +## Theta Sketch +A [Theta Sketch](https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html) is a data structure and algorithm used to perform approximate count discount calculations on a large dataset without having to store them all individually. More details can be found in this [Apache Datasketch](https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html) public doc. + +| Type | Function Spec | +|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Aggregate | **FunctionName**: [theta_sketch_int64(id_col, lg_k)](../community/theta_sketch_int64.sqlx) <br> **Input**: Id_col -> INT64, lg_k -> INT64(constant) <br> **Output:** Sketch Bytes <br> **Description:** Aggregates id_col, log_k args and returns a theta sketch. | +| Aggregate | **FunctionName**: [theta_sketch_bytes(bytes_col, lg_k)](../community/theta_sketch_bytes.sqlx) <br> **Input**: Bytes_col -> BYTES, lg_k -> INT64(constant) <br> **Output:** Sketch Bytes <br> **Description:** Aggregates bytes_col, log_k args and returns a theta sketch. <br> **Note**: This function can also be used for initializing theta sketch for string cols. Just CAST( string_col AS BYTES FORMAT 'UTF-8') and pass to this function | +| Aggregate | **FunctionName**: [theta_sketch_union(theta_sketch, lg_k)](../community/theta_sketch_union.sqlx) <br> **Input**: Sketch Bytes, lg_k -> INT64 (constant) <br> **Output**: sketch Bytes<br> **Description**: Aggregates multiple theta sketches, performs a union op and returns a merged theta sketch | +| Aggregate | **FunctionName**: [theta_sketch_intersection(theta_sketch)](../community/theta_sketch_intersection.sqlx) <br> **Input**: Sketch Bytes <br> **Output**: sketch Bytes<br> **Description**: Aggregates multiple theta sketches, performs an intersection op and returns a merged theta sketch | +| Scalar | **FunctionName**: [theta_sketch_a_not_b(theta_sketch_a, theta_sketch_b)](../community/theta_sketch_a_not_b.sqlx) <br> **Input**: sketch_a -> BYTES, sketch_b -> BYTES <br> **Output**: sketch Bytes<br> **Description**: Takes in 2 theta sketches, performs a difference op / a_not_b op (i.e SetA - SetB) and returns a theta_sketch | +| Scalar | **FunctionName**: [theta_sketch_extract(theta_sketch)](../community/theta_sketch_extract.sqlx) <br> **Input**: theta_sketch -> Bytes <br> **Output**: FLOAT64 <br> **Description**: Takes in a theta sketch, returns approx distinct count of entries of the id_col used to create the theta sketch + + +<a id="lgk---precision-parameter"></a> +### Lg_k - Precision parameter + +Lg_k is one of the parameters in select theta sketch functions. +Choice of lg_k int64 parameter is important as it has direct correlation between your query slot usage, theta_sketch size and relative error. +- Lg_k vs K vs Relative error comparison can be found in [this public doc](https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html) +- theta_sketch size vs K comparison can be found in [this public doc](https://datasketches.apache.org/docs/Theta/ThetaSize.html) + +<a id="examples"></a> +### Examples + +1. Generates 2 tables with sample data ( treated as SetA, setB) + +```sql +-- Generate ints from [1, 5M] +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.set_a AS +SELECT + CONCAT("group_key_", CAST(RAND()*10 AS INT64)) as group_key, + 1000000*x2 + x1 as x +from UNNEST(GENERATE_ARRAY(1,1000000)) as x1, + UNNEST(GENERATE_ARRAY(0,4)) as x2; + +-- Generate ints from [4M, 7M] +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.set_b AS +SELECT + CONCAT("group_key_", CAST(RAND()*10 AS INT64)) as group_key, + 1000000*x2 + x1 as x +from UNNEST(GENERATE_ARRAY(1,1000000)) as x1, + UNNEST(GENERATE_ARRAY(4,6)) as x2; +``` +2.Groups on a certain key and aggregates theta_sketches per key +```sql +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.theta_agg_by_key_set_a AS +select + group_key, + count(*) as total_count, + bqutil.fn.theta_sketch_int64(x,14) as theta_sketch +from `$BQ_PROJECT.$BQ_DATASET`.set_a +group by group_key; + +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.theta_agg_by_key_set_b AS +select + group_key, + count(*) as total_count, + bqutil.fn.theta_sketch_int64(x,14) as theta_sketch +from `$BQ_PROJECT.$BQ_DATASET`.set_b +group by group_key; +``` + +4. Merges those grouped theta_sketches, find unique counts for Union, Intersection, AnotB, BnotA set operations. +```sql +WITH + union_data_set_a AS ( + SELECT + bqutil.fn.theta_sketch_union(theta_sketch,14) as merged_theta_sketch, + SUM(total_count) as total_count + from `$BQ_PROJECT.$BQ_DATASET`.theta_agg_by_key_set_a + ), + union_data_set_b AS ( + SELECT + bqutil.fn.theta_sketch_union(theta_sketch,14) as merged_theta_sketch, + SUM(total_count) as total_count + from `$BQ_PROJECT.$BQ_DATASET`.theta_agg_by_key_set_b + ), + agg_data_set_a_b AS ( + SELECT + bqutil.fn.theta_sketch_intersection(merged_theta_sketch) as intersected_theta_sketch, + sum(total_count) as total_count, + bqutil.fn.theta_sketch_union(merged_theta_sketch, 14) as union_theta_sketch + from ( + SELECT + merged_theta_sketch, + total_count + FROM union_data_set_b + UNION ALL + SELECT + merged_theta_sketch, + total_count + FROM union_data_set_a + ) + ), + set_difference AS ( + SELECT + bqutil.fn.theta_sketch_a_not_b(a.merged_theta_sketch, b.merged_theta_sketch) as a_not_b_theta_sketch, + bqutil.fn.theta_sketch_a_not_b(b.merged_theta_sketch, a.merged_theta_sketch) as b_not_a_theta_sketch, + a.total_count as set_a_count, + b.total_count as set_b_count + from union_data_set_a a, union_data_set_b b + ) +select + "SetA = [1, 5M] \nSetB = (4M, 7M]" as set_info, + bqutil.fn.theta_sketch_extract(intersected_theta_sketch) as intersect_uniq_count, + bqutil.fn.theta_sketch_extract(union_theta_sketch) as union_uniq_count, + bqutil.fn.theta_sketch_extract(a_not_b_theta_sketch) as a_not_b_uniq_count, + bqutil.fn.theta_sketch_extract(b_not_a_theta_sketch) as b_not_a_uniq_count, + set_a_count, + set_b_count, + total_count, + intersected_theta_sketch, + union_theta_sketch, + a_not_b_theta_sketch, + b_not_a_theta_sketch, +FROM agg_data_set_a_b, set_difference; +``` + +<a id="kll-sketch"></a> +## KLL Sketch +[KLL sketch](https://datasketches.apache.org/docs/KLL/KLLSketch.html) is a quantile type mergeable streaming sketch algorithm to estimate the distribution of values, and approximately answer queries about quantiles (median, min, max, 95th percentile and such). + +| Type | Function Spec | +|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Aggregate | **FunctionName**: [kll_sketch_int64(id_col, k)](../community/kll_sketch_int64.sqlx) <br> **Input**: Id_col -> INT64, k -> INT64(constant) <br> **Output:** Sketch Bytes <br> **Description:** Aggregates id_col, k args and returns a KLL sketch. | +| Aggregate | **FunctionName**: [kll_sketch_float64(id_col, k)](../community/kll_sketch_float64.sqlx) <br> **Input**: Id_col -> FLOAT64, k -> INT64(constant) <br> **Output:** Sketch Bytes <br> **Description:** Aggregates id_col, k args and returns a KLL sketch. | +| Aggregate | **FunctionName**: [kll_sketch_merge(kll_sketch, k)](../community/kll_sketch_merge.sqlx) <br> **Input**: Bytes_col -> BYTES, k -> INT64(constant) <br> **Output:** Sketch Bytes <br> **Description:** Aggregates KLL sketches, k arg, performs a union op and returns a merged KLL sketch. | +| Scalar | **FunctionName**: [kll_sketch_quantile(kll_sketch, rank)](../community/kll_sketch_quantile.sqlx) <br> **Input**: Sketch Bytes, rank -> FLOAT64 in range [0,1] <br> **Output**: FLOAT64 <br> **Description**: Takes in KLL sketch and rank value and returns quantile value. eg. a rank of 0.5 will return median, rank = 1 returns max | + +<a id="k---precision-parameter"></a> +### K - Precision Parameter +k is one of the arguments in select KLL sketch functions. Choice of k int64 is important as it has direct correlation between your query runtime, slot usage, kll_sketch size and relative error. +- K vs Relative error vs KLL-sketch size comparison can be found in [this public doc](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html) + +<a id="examples-1"></a> +### Examples + +1. Creating sample data with 10B records ( 1 through 10B) split in 100 nearly equal sized groups of 100M values +Note: This sample data generation query might run for a long time ( > 30 mins ) to generate 10B records. Try reducing the size for a faster outcome. +```sql +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.sample_data_10B AS +SELECT +CONCAT("group_key_", CAST(RAND()*100 AS INT64)) as group_key, +1000000*x2 + x1 as x +from UNNEST(GENERATE_ARRAY(1,1000000)) as x1, +UNNEST(GENERATE_ARRAY(0,9999)) as x2; +``` + +2. Creating KLL merge sketches for a group key + +```sql +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.agg_sample_data_10B AS +select +group_key, +count(*) as total_count, +bqutil.fn.kll_sketch_int64(x,250) as kll_sketch +from `$BQ_PROJECT.$BQ_DATASET`.sample_data_10B +group by group_key; +``` + +3. Merge group based sketches into a single sketch and then get approx quantiles + +```sql +WITH +agg_data AS ( +SELECT +bqutil.fn.kll_sketch_merge(kll_sketch,250) as merged_kll_sketch, +SUM(total_count) as total_count +from `$BQ_PROJECT.$BQ_DATASET`.agg_sample_data_10B +) +select +bqutil.fn.kll_sketch_quantile(merged_kll_sketch, 0.0) as mininum, +bqutil.fn.kll_sketch_quantile(merged_kll_sketch, 0.5) as p50, +bqutil.fn.kll_sketch_quantile(merged_kll_sketch, 0.75) as p75, +bqutil.fn.kll_sketch_quantile(merged_kll_sketch, 0.95) as p95, +bqutil.fn.kll_sketch_quantile(merged_kll_sketch, 1.0) as maximum, +merged_kll_sketch, +total_count +FROM agg_data; +``` +<a id="tuple-sketch"></a> +## Tuple Sketch +A [Tuple Sketch](https://datasketches.apache.org/docs/Tuple/TupleOverview.html) is an extension of the [Theta Sketch](https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). Tuple sketches store an additional summary value with each retained entry which makes the sketch ideal for summarizing attributes such as impressions or clicks. Tuple sketches enable set operations over a stream of data, and can also be used for cardinality estimation. + +| Type | Function Spec | +|-----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Aggregate | **FunctionName**: [tuple_sketch_int64(id_col, value_col, lg_k)](../community/tuple_sketch_int64.sqlx) <br> **Input**: Id_col -> INT64, value_col -> INT64, lg_k -> INT64(constant) <br> **Output:** Sketch Bytes <br> **Description:** Aggregates id_col, value_col and log_k args and returns a tuple sketch. | +| Aggregate | **FunctionName**: [tuple_sketch_union(tuple_sketch, lg_k)](../community/tuple_sketch_union.sqlx) <br> **Input**: Sketch Bytes, lg_k -> INT64 (constant) <br> **Output**: sketch Bytes<br> **Description**: Aggregates multiple tuple sketches, performs a union op and returns a merged tuple sketch | +| Scalar | **FunctionName**: [tuple_sketch_extract_count(tuple_sketch)](../community/tuple_sketch_extract_count.sqlx) <br> **Input**: tuple_sketch -> Bytes <br> **Output**: INT64 <br> **Description**: Takes in a tuple sketch, returns approx distinct count of entries of the id_col used to create the tuple sketch | +| Scalar | **FunctionName**: [tuple_sketch_extract_sum(tuple_sketch)](../community/tuple_sketch_extract_sum.sqlx) <br> **Input**: tuple_sketch -> Bytes <br> **Output**: INT64 <br> **Description**: Takes in a tuple sketch, combines the summary values of the value_col (using sum) from the random sample of id_col stored within the Tuple sketch, calculates an estimate that applies to the entire dataset and returns the sum. | +| Scalar | **FunctionName**: [tuple_sketch_extract_avg(tuple_sketch)](../community/tuple_sketch_extract_avg.sqlx) <br> **Input**: tuple_sketch -> Bytes <br> **Output**: INT64 <br> **Description**: Takes in a tuple sketch, combines the summary values of the value_col (using sum) from the random sample of id_col stored within the Tuple sketch, calculates an estimate that applies to the entire dataset and returns average. | +| Scalar | **FunctionName**: [tuple_sketch_extract_summary(tuple_sketch)](../community/tuple_sketch_extract_summary.sqlx) <br> **Input**: tuple_sketch -> Bytes <br> **Output**: STRUCT<key_distinct_count INT64, value_sum INT64, value_avg INT64> <br> **Description**: Takes in a tuple sketch and returns summary of key and value cols i.e struct<uniq_count, sum, avg>. This function combines output of the other 3 scalar functions above. | + +<a id="lgk---precision-parameter-1"></a> +### Lg_k - Precision parameter + +Lg_k is one of the parameters in select tuple sketch functions. +Choice of lg_k int64 parameter is important as it has direct correlation between your query slot usage, tuple_sketch size and relative error. +- Lg_k vs K vs Relative error comparison can be found in [this public doc](https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html) +- theta_sketch size vs K comparison can be found in [this public doc](https://datasketches.apache.org/docs/Theta/ThetaSize.html) + +Note: As there is no dedicated tuple_sketch size vs lg_k table, You can assume tuple_sketch size will be nearly double of the theta_sketch size due to additional summary col maintained per hash key in the sketch. + +<a id="examples-2"></a> +### Examples + +1. Creating sample data with 100M records ( 1 through 100M) split in 10 nearly equal sized groups of 10M values + +```sql +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.sample_data_100M AS +SELECT + CONCAT("group_key_", CAST(RAND()*10 AS INT64)) as group_key, + 1000000*x2 + x1 as user_id, + X2 as clicks +from UNNEST(GENERATE_ARRAY(1,1000000)) as x1, + UNNEST(GENERATE_ARRAY(0,99)) as x2 +``` + +2. Creating Tuple sketches for a group key + +```sql +CREATE OR REPLACE TABLE `$BQ_PROJECT.$BQ_DATASET`.agg_sample_data_100M AS +select + group_key, + count(distinct user_id) as exact_uniq_users_ct, + sum(clicks) as exact_clicks_ct, + bqutil.fn.tuple_sketch_int64(user_id,clicks,14) as tuple_sketch +from `$BQ_PROJECT.$BQ_DATASET`.sample_data_100M +group by group_key; +``` + +3. Calculating tuple_sketch_summary for every grouped tuple_sketch and comparing with exact metrics + +```sql +select + group_key, + exact_uniq_users_ct, + exact_clicks_ct, + FLOOR(exact_clicks_ct/exact_uniq_users_ct) as clicks_avg, + bqutil.fn.tuple_sketch_extract_summary(tuple_sketch) as tuple_sketch_summary +from `$BQ_PROJECT.$BQ_DATASET`.agg_sample_data_100M +``` + +4. Merge group based sketches into a single sketch and then extract relevant metric aggregations like sum, avg and distinct count + +```sql +WITH +agg_data AS ( + SELECT + bqutil.fn.tuple_sketch_union(tuple_sketch,14) as merged_tuple_sketch, + SUM(exact_uniq_users_ct) as total_uniq_users_ct, + from `$BQ_PROJECT.$BQ_DATASET`.agg_sample_data_100M +) +Select + total_uniq_users_ct, + bqutil.fn.tuple_sketch_extract_summary(merged_tuple_sketch) as summary, + bqutil.fn.tuple_sketch_extract_count(merged_tuple_sketch) as approx_dist_user_ct, + bqutil.fn.tuple_sketch_extract_sum(merged_tuple_sketch) as approx_clicks_ct, + bqutil.fn.tuple_sketch_extract_avg(merged_tuple_sketch) as approx_clicks_avg, + merged_tuple_sketch +FROM agg_data; +``` --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
