This is an automated email from the ASF dual-hosted git repository. charlie pushed a commit to branch lib-comparison in repository https://gitbox.apache.org/repos/asf/datasketches-python.git
commit 8dfa66eaae05589cad2b90c8c8d9ef9a88958fb4 Author: Charlie Dickens <[email protected]> AuthorDate: Thu Nov 16 16:07:54 2023 +0000 Initial error comparison --- jupyter/comparison-to-datasketch/README.md | 37 + .../comparison-to-datasketch/api-comparison.ipynb | 922 +++++++++++++++++++++ .../cardinality_error_experiment.py | 172 ++++ .../single-sketch-accuracy.ipynb | 762 +++++++++++++++++ jupyter/comparison-to-datasketch/utils.py | 37 + 5 files changed, 1930 insertions(+) diff --git a/jupyter/comparison-to-datasketch/README.md b/jupyter/comparison-to-datasketch/README.md new file mode 100644 index 0000000..ed03e61 --- /dev/null +++ b/jupyter/comparison-to-datasketch/README.md @@ -0,0 +1,37 @@ +# A Comparison of Python Sketching Libraries + +These notebooks aim to provide a fair comparison between the Python support for +[Apache Software Foundation (ASF) DataSketches](https://pypi.org/project/datasketches/) and the [datasketch](https://pypi.org/project/datasketch/) +library. +The notebooks are not an attempt to fully characterize either library's implementation, rather to highlight differences that manifest as a result +of design choices for each library. + +## Summary + +As of the time of writing, November 2023, the versions under comparison are: +``` +Name: datasketches +Version: 4.1.0 + +Name: datasketch +Version: 1.6.4 +``` + +## 1. Distinct Counting + +### 1a. HyperLogLog +Each library implements various algorithms that fall under the HyperLogLog umbrella. +They can be summarized as follows + +| | ASF | datasketch | | +|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------------------------------------------------------------| +| | HyperLogLog | HyperLogLog | HyperLogLog++ | +| Hash functions | 64 bit MurmurHash | SHA1 32 bit. Others are possible. | SHA1 64 bit. Others are possible. | +| Bucket sizes (bits) | 4, 6, 8 | 4 | 8 | +| Estimator | Exact mode for small cardinalities Historic Inverse Probability (HIP) for a single sketch; Harmonic mean after merging multiple sketches | No exact mode. Harmonic mean for all sketches | Harmonic mean for all sketches (single or merged) with bias corrections at small and large cardinalities | + + +Since `datasketch.hyperloglog.HyperLogLog` uses only $32$ bit hash functions, we do not regard this as an appropriate solution for +industry-applications because the error estimates will vary wildly when the input is large enough. +To ensure the fairest comparison between the sketches, we will only compare the ASF implementations with `datasketch.hyperloglog.HyperLogLogPlusPlus` +with the MurmurHash function over $64$ bits using `mmh3.hash64(x, signed=False)[0]`. \ No newline at end of file diff --git a/jupyter/comparison-to-datasketch/api-comparison.ipynb b/jupyter/comparison-to-datasketch/api-comparison.ipynb new file mode 100644 index 0000000..3d6dabb --- /dev/null +++ b/jupyter/comparison-to-datasketch/api-comparison.ipynb @@ -0,0 +1,922 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "547cbc07", + "metadata": {}, + "source": [ + "# Implementation Comparison ASF DataSketches vs datasketch\n", + "\n", + "We compare implementations of the HyperLogLog algorithm from the [Apache Software Foundation DataSketches](https://github.com/apache/datasketches-cpp/tree/master/python) library and the open-source python [datasketch](https://github.com/ekzhu/datasketch). Both libraries present python implementations or bindings for common \"sketching\" algorithms.\n", + "In the writing we abbreviate the two implementations as `ASF:HLL` for the ASF DataSketches \n", + "and `datasketch:HLL` for the other implementation.\n", + "\n", + "This notebook needs easily available libraries that are available on PyPi." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "7e9b5d90", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "import numpy as np\n", + "import datasketches as asf\n", + "import datasketch as ds\n", + "import mmh3" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "33a4124b", + "metadata": {}, + "outputs": [], + "source": [ + "# Plotting parameters\n", + "method_plot_params = {\n", + " \"asf\" : {\"color\": \"C0\", \"marker\": '.'},\n", + " \"datasketch\" : {\"color\": \"C1\", \"marker\": \"^\"}\n", + "}\n", + "asf_color = method_plot_params[\"asf\"][\"color\"]\n", + "ds__color = method_plot_params[\"datasketch\"][\"color\"]\n", + "q90_ls = \"--\"\n", + "\n", + "params = {'legend.fontsize': 'x-large',\n", + " 'axes.labelsize': 'x-large',\n", + " 'axes.titlesize':'x-large',\n", + " 'xtick.labelsize':'x-large',\n", + " 'ytick.labelsize':'x-large',\n", + " \"lines.linewidth\": 2.5}\n", + "plt.rcParams.update(params)" + ] + }, + { + "cell_type": "markdown", + "id": "11f9a14b", + "metadata": {}, + "source": [ + "## HyperLogLog API\n", + "The two libraries implement HyperLogLog (HLL) slightly differently. We aim to find a baseline that highlights these differences. The following tables summarize the differences.\n", + "\n", + "| Feature | ASF DataSketches |\n", + "| --- | --- |\n", + "| **Type of hash functions** | 64-bit Murmurhash only |\n", + "| **Bucket sizes (bits)** | $4,6$ or $8$ bits. Bucket size does not affect accuracy, only sketch size in bits. | \n", + "|**Estimator** | Smoothly transitions across various estimators for greatest accuracy. Implements a higher accuracy estimator for the \"single\" sketch setting [cite here] |\n", + "\n", + "| Feature | datasketch |\n", + "| --- | --- |\n", + "| **Type of hash functions** | `hashlib` SHA1 by default but any other callable hash functions (e.g. 32 or 64 bit Murmurhash or xxhash) |\n", + "| **Bucket sizes (bits)** | $4$ for $32$ bit hash, $8$ for $64$ bit hash. | \n", + "|**Estimator** | Only the \"standard\" HLL estimators. |\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "9e0d0c7a", + "metadata": {}, + "source": [ + "### <a name=\"error_vs_cardinality\"></a>1. Error vs Cardinality \n", + "We study the error behaviour as the input cardinality increases. We provide an initial comparison to highlight the differences in the estimators.\n", + "These plots are crucial to understanding error distributions of sketches and are not presented in the `datasketch` library documentation.\n", + "\n", + "To generate the synthetic data, we use \"Fibonacci Hashing\" as a cheap way to generate a pseudorandom sequence. This process starts with an initial selection of a $64$-bit integer. Then, for every new item that must be generated, we add the full $64$ bit range scaled by the integer golden ratio so that every other update _intentionally_ overflows and maps once more back into the $64$ bit range.\n", + "\n", + "#### 1a. Single sketch estimation\n", + "\n", + "From the project root run\n", + "\n", + "```./cardinality_error_experiment.py -lgk 14 -lgt 7 -lgN 20``` \n", + "\n", + "which generates $8$ \"plot points\" between every power of $2$ not exceeding $N = 2^{21}$. We fix `-lgt 7` for $128$ trials and use a sketch size of with $2^{14}$. For every trial, an fresh sketch is initialised.\n", + "\n", + "*ASF DataSketches Sketch Setup*\n", + "- `hll = asf.hll_sketch(self.sketch_lgk)`\n", + "\n", + "*datasketch sketch setup*\n", + "- `h = ds.HyperLogLogPlusPlus(p=self.sketch_lgk, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])`\n", + "\n", + "This setting uses the default target type for the datasketches implementation which is $8$ bits per bucket.\n", + "Does HLL bucket size only affect the serialization?" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "1a237439", + "metadata": {}, + "outputs": [], + "source": [ + "asf_errors = pd.read_csv(\"../hll_accuracy_profile_20230504/DataSketches_hll_1516lgK_14_lgT_8trials_256.csv\", index_col=0)\n", + "ds__errors = pd.read_csv(\"../hll_accuracy_profile_20230504/datasketch_hll_1516lgK_14_lgT_8trials_256.csv\", index_col=0)\n", + "\n", + "lgk = 14" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "a542b13c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0.5, 0, 'Input cardinality $n$')" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAABCIAAAMKCAYAAABQieWGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdvA4d+m90aHBEJJ6L33XpUioAiIIF2KYvmsrwIqtlfQABYQCUVfRIr03iF0kV4CIRBqAgkphPTM98ewLbubRkjjua8rF5mzZ2bObGaHnWfOeY5GURQFIYQQQgghhBBCiHxgVdANEEIIIYQQQgghxLNDAhFCCCGEEEIIIYTINxKIEEIIIYQQQgghRL6RQIQQQgghhBBCCCHyjQQihBBCCCGEEEIIkW8kECGEEEIIIYQQQoh8I4EIIYQQQgghhBBC5Bubgm6AEMJUSkoKaWlpBd0MIYQQQggh [...] + "text/plain": [ + "<Figure size 1200x800 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig, ax = plt.subplots(figsize=(12,8))\n", + "\n", + "methods = [\"ASF\", \"datasketch\"]\n", + "\n", + "for i, (method, colour, df) in enumerate(zip(methods, [asf_color, ds__color], [asf_errors, ds__errors])):\n", + " xn = df.index \n", + " median = df.median(axis=1)\n", + " q95 = df.quantile(q=0.977725, axis=1)\n", + " q05 = df.quantile(q=0.022275, axis=1) # df.mean(axis=1) - df.std(axis=1)\n", + " ax.plot(xn, median,\n", + " color=colour, label=method+\": median\")\n", + " ax.plot(xn, q95,\n", + " color=colour, linestyle=q90_ls)\n", + " ax.plot(xn, q05,\n", + " color=colour, linestyle=q90_ls, label=method+\": 90% CI\")\n", + "\n", + "ax.plot(xn, 2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n", + " color=\"red\", lw=3., linestyle=\":\", label=\"q = 0.97725\")\n", + "ax.plot(xn, -2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n", + " color=\"red\", lw=3., linestyle=\"-.\", label=\"q=0.2275\")\n", + "\n", + "ax.set_xscale('log', base=10)\n", + "ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),\n", + " ncol=3, fancybox=True)\n", + "ax.grid()\n", + "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n", + "ax.set_xlabel(r\"Input cardinality $n$\")" + ] + }, + { + "cell_type": "markdown", + "id": "ce0cc56d", + "metadata": {}, + "source": [ + "**Experiment Summary**\n", + "\n", + "The plotted red lines represent $\\pm 2 \\sigma$ where, for HLL, the standard error of the estimator is $\\sigma = 1.04 / \\sqrt{k}$. \n", + "Hence, the area between the red lines is a confidence interval. We have chosen the quantiles \n", + "$q = 0.022275, 0.977725$ so that the area between the red lines is approximately a $90\\%$ confidence interval.\n", + "This test uses a number of buckets $k = 2^{14}$. Changing the number of buckets in the sketch alters the standard error so the red lines, and the behaviour of the error curves at quantile level $q$, would change appropriately.\n", + "\n", + "\n", + "Both implementations have a median that is centered about an error of $0$, suggesting that they are unbiased estimators. At small cardinalities, `ASF:HLL` is vastly better than the `datasketch:HLL`. This is because it transitions through various estimators to ensure small error. All of the curves for `ASF:HLL` have zero error until about $n = 10^3$ because it is in sparse mode and a different estimator can be used. Without this behaviour and resorting to the standard change of e [...] + "\n", + "At large cardinalities, _for a single sketch_, `ASF:HLL` uses the HIP estimator which mildly improves on the standard large cardinality estimator. However, when we need to merge sketches later on, this behaviour will revert to the standard HLL estimator.\n", + "\n", + "### 2. Single sketch estimation: Real data\n", + "Now let's see what happens when we want sketch some real data. \n", + "We will download an opensource dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 135, + "id": "91403f05", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " % Total % Received % Xferd Average Speed Time Time Time Current\n", + " Dload Upload Total Spent Left Speed\n", + "100 110M 100 110M 0 0 5252k 0 0:00:21 0:00:21 --:--:-- 6128k258k 0 0:07:18 0:00:01 0:07:17 260k3:51 0:00:02 0:03:49 491k40k 0 0:00:22 0:00:16 0:00:06 6121k\n" + ] + } + ], + "source": [ + "!curl -o bitcoin.zip \"https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip\"" + ] + }, + { + "cell_type": "code", + "execution_count": 137, + "id": "b1ffc601", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Archive: bitcoin.zip\n", + " inflating: BitcoinHeistData.csv \n" + ] + } + ], + "source": [ + "!unzip bitcoin.zip" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "6b139532", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>address</th>\n", + " <th>year</th>\n", + " <th>day</th>\n", + " <th>length</th>\n", + " <th>weight</th>\n", + " <th>count</th>\n", + " <th>looped</th>\n", + " <th>neighbors</th>\n", + " <th>income</th>\n", + " <th>label</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>111K8kZAEnJg245r2cM6y9zgJGHZtJPy6</td>\n", + " <td>2017</td>\n", + " <td>11</td>\n", + " <td>18</td>\n", + " <td>0.008333</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>100050000.0</td>\n", + " <td>princetonCerber</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>1123pJv8jzeFQaCV4w644pzQJzVWay2zcA</td>\n", + " <td>2016</td>\n", + " <td>132</td>\n", + " <td>44</td>\n", + " <td>0.000244</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>100000000.0</td>\n", + " <td>princetonLocky</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>112536im7hy6wtKbpH1qYDWtTyMRAcA2p7</td>\n", + " <td>2016</td>\n", + " <td>246</td>\n", + " <td>0</td>\n", + " <td>1.000000</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>200000000.0</td>\n", + " <td>princetonCerber</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7</td>\n", + " <td>2016</td>\n", + " <td>322</td>\n", + " <td>72</td>\n", + " <td>0.003906</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>71200000.0</td>\n", + " <td>princetonCerber</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>1129TSjKtx65E35GiUo4AYVeyo48twbrGX</td>\n", + " <td>2016</td>\n", + " <td>238</td>\n", + " <td>144</td>\n", + " <td>0.072848</td>\n", + " <td>456</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>200000000.0</td>\n", + " <td>princetonLocky</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " address year day length weight count \n", + "0 111K8kZAEnJg245r2cM6y9zgJGHZtJPy6 2017 11 18 0.008333 1 \\\n", + "1 1123pJv8jzeFQaCV4w644pzQJzVWay2zcA 2016 132 44 0.000244 1 \n", + "2 112536im7hy6wtKbpH1qYDWtTyMRAcA2p7 2016 246 0 1.000000 1 \n", + "3 1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7 2016 322 72 0.003906 1 \n", + "4 1129TSjKtx65E35GiUo4AYVeyo48twbrGX 2016 238 144 0.072848 456 \n", + "\n", + " looped neighbors income label \n", + "0 0 2 100050000.0 princetonCerber \n", + "1 0 1 100000000.0 princetonLocky \n", + "2 0 2 200000000.0 princetonCerber \n", + "3 0 2 71200000.0 princetonCerber \n", + "4 0 1 200000000.0 princetonLocky " + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bitcoin_df = pd.read_csv(\"BitcoinHeistData.csv\", header=0)\n", + "bitcoin_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "e84320f6", + "metadata": {}, + "source": [ + "Let's focus on the simple task of counting how many unique addresses are present in the dataset.\n", + "With native pandas functionality, we see that there are about $2.6$ million unique addresses.\n", + "We will use HLL sketches to estimate this count." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "00a8f7bf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 1.15 s, sys: 89.9 ms, total: 1.24 s\n", + "Wall time: 1.26 s\n" + ] + } + ], + "source": [ + "%%time\n", + "true_count = bitcoin_df[\"address\"].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "30521d71", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "There are 2631095 unique addresses\n" + ] + } + ], + "source": [ + "print(f\"There are {true_count} unique addresses\")" + ] + }, + { + "cell_type": "markdown", + "id": "318fef9e", + "metadata": {}, + "source": [ + "Now define equivalent sketches from both libraries.\n", + "We use $2^{14}$ buckets and $8$-bit HyperLogLog sketches for each implementation. The `datsketch:HLL` uses the MurmurHash library so that we have equivalent sketches for comparison." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "a5ee01e2", + "metadata": {}, + "outputs": [], + "source": [ + "asf_hll = asf.hll_sketch(14, asf.HLL_8)\n", + "ds_hll = ds.HyperLogLogPlusPlus(p=14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "3652c114", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 3.93 s, sys: 32 ms, total: 3.96 s\n", + "Wall time: 4.01 s\n" + ] + } + ], + "source": [ + "%%time\n", + "for ad in bitcoin_df[\"address\"]:\n", + " asf_hll.update(ad)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5b59567b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 9.33 s, sys: 36.8 ms, total: 9.37 s\n", + "Wall time: 9.52 s\n" + ] + } + ], + "source": [ + "%%time\n", + "for ad in bitcoin_df[\"address\"]:\n", + " ds_hll.update(ad)" + ] + }, + { + "cell_type": "markdown", + "id": "aedd1cbe", + "metadata": {}, + "source": [ + "On this simple example we see that the datasketches implementation takes about $4$ seconds compared to about $10$ for datasketch. Note that these times are longer than for the native Pandas call to nunique; this is not a problem because, unlike `pd.nunique(.)` the sketches are designed for large datasets not entirely held in memory.\n", + "\n", + "For estimation, we have the following behaviour for the single sketches. Since we have the true count, we can also evaluate the error." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "e2c2fb5c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ASF estimate: 2650083.4660\n", + "ASF error: 0.7217 %\n" + ] + } + ], + "source": [ + "# DataSketches\n", + "print(f\"ASF estimate: {asf_hll.get_estimate():.4f}\")\n", + "print(f\"ASF error: {100*(asf_hll.get_estimate()-true_count)/true_count:.4f} %\")" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "04607338", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "datasketch estimate: 2646133.7361\n", + "datasketch error: 0.5716 %\n" + ] + } + ], + "source": [ + "# Datasketch\n", + "print(f\"datasketch estimate: {ds_hll.count():.4f}\")\n", + "print(f\"datasketch error: {100*(ds_hll.count() - true_count)/true_count:.4f} %\")" + ] + }, + { + "cell_type": "markdown", + "id": "d88fc63f", + "metadata": {}, + "source": [ + "On this example, the datasketch implementation has a lower error than the ASF method. However, this was a single sketch so we cannot draw any strong conclusions. Rather, we would have to study the error distribution as previously done in [Section 1](error_vs_cardinality). \n", + "\n", + "We run $25$ independent trials of each algorithm, each trial with a fresh sketch.\n", + "Since HLL is deterministic given the hash seed, with no change to the input we would obtain the same output every time. To avoid this, we prepend the trial number to every incoming string so that the number of unique items remains the same but the streams are superficially different." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "4e7b941c", + "metadata": {}, + "outputs": [], + "source": [ + "lgk = 14\n", + "num_trials = 25\n", + "\n", + "all_asf_hll = [asf.hll_sketch(14, asf.HLL_8) for _ in range(num_trials)]\n", + "all_ds_hll = [ds.HyperLogLogPlusPlus(p=14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0]) for _ in range(num_trials)]\n", + "\n", + "asf_hll_estimates = np.zeros((num_trials,), dtype=float)\n", + "asf_hll_errors = np.zeros_like(asf_hll_estimates)\n", + "ds__hll_estimates = np.zeros_like(asf_hll_estimates)\n", + "ds__hll_errors = np.zeros_like(asf_hll_estimates)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "ec3fb80b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 2min 2s, sys: 1.34 s, total: 2min 3s\n", + "Wall time: 2min 9s\n" + ] + } + ], + "source": [ + "%%time\n", + "for trial in range(num_trials):\n", + " for ad in bitcoin_df[\"address\"]:\n", + " all_asf_hll[trial].update(str(trial) + ad)\n", + " asf_hll_estimates[trial] = all_asf_hll[trial].get_estimate()" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "68f3e787", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 4min 18s, sys: 806 ms, total: 4min 19s\n", + "Wall time: 4min 21s\n" + ] + } + ], + "source": [ + "%%time\n", + "for trial in range(num_trials):\n", + " for i, ad in enumerate(bitcoin_df[\"address\"]):\n", + " all_ds_hll[trial].update(str(trial) + ad)\n", + " ds__hll_estimates[trial] = all_ds_hll[trial].count()" + ] + }, + { + "cell_type": "markdown", + "id": "254e2d14", + "metadata": {}, + "source": [ + "The ASF HLL runs in about half of the time as the datasketch implementation.\n", + "However, we are also interested in the distribution of errors for each sketch implementation.\n", + "Since we have fewer trials than in Section 1, we plot a box and whisker diagram which is still useful in understanding the error distribution, despite being less informative about the full error distribution than the pitchfork plots from Section 1. The plot can be interpreted as a cross-section of the pitchfork plot at the vertical line $n = 2631095$." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "a00a0f70", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnwAAAGiCAYAAAB9MRuUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABz8klEQVR4nO3deVzU1f4/8NewLwqCaIIgCIpSmvsGiYoX3EotSL1XLTVLLVeqn9oiYi7VLU1Ls7TQ1G+ZaIBXI7ioqKAILmk3cYxFRTQXdAZZB/j8/vjIyDggAwwMM/N6Ph48Zuac8znnDePIm/P5fM6RCIIggIiIiIgMlomuAyAiIiKixsWEj4iIiMjAMeEjIiIiMnBM+IiIiIgMHBM+IiIiIgPHhI+IiIjIwDHhIyIiIjJwTPiIiIiIDJyZrgOg5qGiogK5ublo2bIlJBKJrsMhIiIiDQiCgPz8fLi4uMDEpOZ5 [...] + "text/plain": [ + "<Figure size 640x480 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "asf_hll_errors = (asf_hll_estimates - true_count) / true_count\n", + "datasketch_hll_errors = (ds__hll_estimates - true_count) / true_count\n", + "\n", + "for arr in [asf_hll_errors, datasketch_hll_errors]:\n", + " arr.sort()\n", + "\n", + "fig, ax = plt.subplots()\n", + "error_data = {\"ASF\": asf_hll_errors, \"datasketch\": datasketch_hll_errors}\n", + "\n", + "#\n", + "box_plot = ax.boxplot(list(error_data.values()), vert=True)\n", + "for median in box_plot['medians']:\n", + " median.set_color('black')\n", + " \n", + "\n", + "\n", + "ax.scatter(np.ones_like(asf_hll_errors), asf_hll_errors, marker=\".\")\n", + "ax.scatter(2*np.ones_like(datasketch_hll_errors), datasketch_hll_errors, marker=\"x\")\n", + "ax.set_xticks([1,2], list(error_data.keys()))\n", + "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n", + "ax.grid()" + ] + }, + { + "cell_type": "markdown", + "id": "3755cc1f", + "metadata": {}, + "source": [ + "The error statistics from the experiment are as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "a8b6c959", + "metadata": {}, + "outputs": [], + "source": [ + "experiment_error_statistics = {\n", + " \"ASF\": {\"median\" : None, \"IQR\" : None}, \n", + " \"datasketch\": {\"median\" : None, \"IQR\" : None}\n", + "}\n", + "key_list = list(experiment_error_statistics.keys())\n", + "\n", + "for i, m in enumerate(box_plot[\"medians\"]):\n", + " method_median = m.get_data()[0][0]\n", + " experiment_error_statistics[key_list[i]][\"median\"] = method_median\n", + "\n", + "for i, line in enumerate(box_plot[\"boxes\"]):\n", + " liney = line.get_ydata()\n", + " iqr = liney.max() - liney.min()\n", + " experiment_error_statistics[key_list[i]][\"IQR\"] = iqr" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "35b55d5f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ASF': {'median': 0.925, 'IQR': 0.009980189698406661},\n", + " 'datasketch': {'median': 1.925, 'IQR': 0.015456189289630978}}" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "experiment_error_statistics" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "27d5b8ae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Library Median IQR \n", + "------------------------------------\n", + "ASF 0.9250 0.0100 \n", + "datasketch 1.9250 0.0155 \n", + "\n", + "\n", + "\n", + "xChange Median IQR \n", + "------------------------------------\n", + " 0.4805 0.6457 \n" + ] + } + ], + "source": [ + "print(\"{:<12}{:<12}{:<12}\".format(\"Library\", \"Median\", \"IQR\"))\n", + "print(\"-\"*36)\n", + "for k,vd in experiment_error_statistics.items():\n", + " print(\"{:<12}{:<12.4f}{:<12.4f}\".format(k, vd[\"median\"], vd[\"IQR\"]))\n", + " \n", + "print(\"\\n\"*2)\n", + "print(\"{:<12}{:<12}{:<12}\".format(\"xChange\", \"Median\", \"IQR\"))\n", + "print(\"-\"*36)\n", + "median_factor = experiment_error_statistics[\"ASF\"][\"median\"] / experiment_error_statistics[\"datasketch\"][\"median\"]\n", + "iqr_factor = experiment_error_statistics[\"ASF\"][\"IQR\"] / experiment_error_statistics[\"datasketch\"][\"IQR\"]\n", + "print(\"{:<12}{:<12.4f}{:<12.4f}\".format(\"\", median_factor, iqr_factor))" + ] + }, + { + "cell_type": "markdown", + "id": "3352f4ed", + "metadata": {}, + "source": [ + "These tables show that, in this example, the median reported solution using ASF DataSketches HLL is about $50\\%$ closer to the true input cardinality. Meanwhile, the interquartile range is about $65\\%$ smaller when using `ASF:HLL` rather than `datasketch:HLL`. In other words, the error distribution is more tightly concentrated about the true answer." + ] + }, + { + "cell_type": "markdown", + "id": "8a615abf", + "metadata": {}, + "source": [ + "## 3. Other key differences\n", + "\n", + "There are other crucial differences in each library's API of which users should be aware.\n", + "\n", + "1. The `update()` method for `ASF:HLL` accepts inputs as integers, strings, bytes, and floats. On the other hand, `datasketch:HLL` library only accepts byte and string type inputs." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "60df51ec", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "4.000000029802323" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Datasketches HLL can accept multiple inputs\n", + "# These are treated as different items in a single sketch.\n", + "asf_hll_types = asf.hll_sketch(14, asf.HLL_8)\n", + "asf_hll_types.update(1)\n", + "asf_hll_types.update(1.0)\n", + "asf_hll_types.update(str(1))\n", + "\n", + "xx = 1\n", + "xx_bytes = xx.to_bytes(64, \"little\")\n", + "asf_hll_types.update(xx_bytes)\n", + "\n", + "asf_hll_types.get_estimate()" + ] + }, + { + "cell_type": "code", + "execution_count": 344, + "id": "410a3bd4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Exception on integer input\n", + "Exception on float input\n", + "Accepts bytes input\n", + "Accepts string input\n", + "2.000122080247517\n" + ] + } + ], + "source": [ + "# datasketch HLL needs bytes\n", + "dhll_type = ds.HyperLogLogPlusPlus(14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])\n", + "try:\n", + " dhll_type.update(1)\n", + "except:\n", + " print(\"Exception on integer input\")\n", + " \n", + "try:\n", + " dhll_type.update(1.0)\n", + "except:\n", + " print(\"Exception on float input\")\n", + " \n", + "try:\n", + " dhll_type.update(xx_bytes)\n", + " print(\"Accepts bytes input\")\n", + "except:\n", + " print(\"Exception on string input\")\n", + " \n", + "try:\n", + " dhll_type.update(str(1))\n", + " print(\"Accepts string input\")\n", + "except:\n", + " print(\"Exception on string input\")\n", + " \n", + "print(dhll_type.count()) # only two distinct items inserted into the sketch." + ] + }, + { + "cell_type": "markdown", + "id": "08c695a5", + "metadata": {}, + "source": [ + "2. The ASF HLL implementation comes with `get_upper_bound()` and `get_lower_bound()` functions. These enable the user to understand with what confidence. On the other hand, the `datasketch` implementation returns only the estimated count.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "64b95f07", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Lower bound (1 std. dev) as % of true value: 99.8909\n", + "ASF HyperLogLog estimate as % of true value: 100.5406\n", + "Upper bound (1 std. dev) as % of true value: 101.1989\n" + ] + } + ], + "source": [ + "asf_hll_sketch = all_asf_hll[0]\n", + "print(f\"Lower bound (1 std. dev) as % of true value: {(100*asf_hll_sketch.get_lower_bound(1) / true_count):.4f}\")\n", + "print(f\"ASF HyperLogLog estimate as % of true value: {(100*asf_hll_sketch.get_estimate() / true_count):.4f}\")\n", + "print(f\"Upper bound (1 std. dev) as % of true value: {(100*asf_hll_sketch.get_upper_bound(1) / true_count):.4f}\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "9a55fa78", + "metadata": {}, + "source": [ + "3. In Section 2 that the `ASF:HLL` is substantially faster. We will study this more thoroughly in a later notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "62724771", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/jupyter/comparison-to-datasketch/cardinality_error_experiment.py b/jupyter/comparison-to-datasketch/cardinality_error_experiment.py new file mode 100644 index 0000000..114d49c --- /dev/null +++ b/jupyter/comparison-to-datasketch/cardinality_error_experiment.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import sys +import argparse +from datetime import datetime +import pandas as pd +import numpy as np +from utils import distinct_number_sequence +import datasketches as ds +import datasketch as d +import mmh3 +import os +import matplotlib.pyplot as plt + +class ErrorCardinalityProfile: + """Generates an experiment comparing the error over different cardinalities""" + def __init__(self, sketch_lgk:int, lg_trials:int, max_lgN:int): + self.sketch_lgk = sketch_lgk + self.num_trials = 2**lg_trials + self.max_lgN = max_lgN + self.max_num_distincts = np.uint64(2 ** self.max_lgN) + self.directory_name = "hll_accuracy_profile_" + datetime.today().strftime('%Y%m%d') + if not os.path.exists(self.directory_name): + os.mkdir(self.directory_name) + self.file_extension = "_" + datetime.today().strftime('%H%M') + f"lgK_{self.sketch_lgk}_lgT_{lg_trials}" + + # Need to remove repeated items for the program logic in self.run() + self.plot_points = self._generate_plot_points() + self.plot_points.extend(self._generate_plot_points()) + self.plot_points = list(set(self.plot_points)) + self.plot_points.sort() + print(self.plot_points) + + # Initialise the data structures for results + self.DataSketches_results_arr = np.zeros((len(self.plot_points), self.num_trials), dtype=float) + self.datasketch_results_arr = np.zeros_like(self.DataSketches_results_arr) + self.DataSketches_results_df = pd.DataFrame(index=self.plot_points, columns=None) + self.datasketch_results_df = pd.DataFrame(index=self.plot_points, columns=None) + self.data = np.random.randn(len(self.plot_points), self.num_trials) + print("Data shape: ", self.data.shape) + + def _generate_plot_points(self) -> list: + """ + Generates the standard sequence defining the input cardinalites for the experiment + This is just two points at each power of 2 + """ + all_plot_points = [] + for lgk in range(1, self.max_lgN+1): + points = np.unique(np.logspace(start=lgk, stop=lgk+1, num=8, endpoint=False, base=2, dtype=np.uint64)) + all_plot_points.extend(points) + all_plot_points.sort() + return all_plot_points + + def _bad_hll_range(self) -> list: + """Generates 16 logspaced points in the n ≈ 2.5k range.""" + log2_sketch_threshold = np.log2(2.5* (2**self.sketch_lgk)) + start = np.floor(log2_sketch_threshold).astype(np.uint64) + stop = np.ceil(log2_sketch_threshold).astype(np.uint64)+1 + points = np.logspace(start, stop, num=50, endpoint=False, base=2, dtype=np.uint64)[1:] + return points + + def _is_power_of_two(self, a): + """Bitwise operations to check value a is a power of two""" + return (a & (a-1) == 0) and a != 0 + + def _results_to_df(self, start_:int, end_:int, arr:np.array, df:pd.DataFrame): + """Concatenates the array between columns start_,...end_ - 1 to the dataframe""" + new_df = pd.DataFrame(arr[:, start_:end_], index=df.index, columns=np.arange(start_, end_).tolist()) + print("concatenating: ", new_df) + concat_df = pd.concat([df, new_df], axis=1) + return concat_df + + def run(self): + """Runs the experiment""" + seq_start = np.uint64(2345234) + distinct_number = np.uint64(3462) + previous_log_trial_index = 0 + ds_all_results = np.zeros((self.num_trials, len(self.plot_points))) + d_all_results = np.zeros_like(ds_all_results) + + for trial in range(1, self.num_trials+1): + #print(f"Trial = {trial}\t{self._is_power_of_two(trial)}") + # Initialise the sketches + hll = ds.hll_sketch(self.sketch_lgk, ds.HLL_8) + h = d.HyperLogLogPlusPlus(p=self.sketch_lgk, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0]) + plot_point_index = 0 # Return to the start of the plot points list to generate the data + plot_point_value = self.plot_points[plot_point_index] + total_updates = 0 + seq_start += distinct_number # Start a new input sequence + + # Temporary result data structure + ds_results = np.zeros((len(self.plot_points),)) + d_results = np.zeros_like(ds_results) + + + for new_number in distinct_number_sequence(seq_start): + hll.update(new_number) + h.update(str(new_number).encode('utf8')) + total_updates += 1 + #print(f"Trial: {trial:<5} updates: {total_updates:<5}Index: {plot_point_index:<5} Value: {plot_point_value:<5}\n") + if total_updates == plot_point_value: + ds_results[plot_point_index] = (hll.get_estimate() - plot_point_value) / plot_point_value + d_results[plot_point_index] = (h.count() - plot_point_value) / plot_point_value + plot_point_index += 1 + #print(f"PPI:{plot_point_index:<3}PPV:{plot_point_value}") + + if plot_point_index < len(self.plot_points): + plot_point_value = self.plot_points[plot_point_index] + else: + #print("Already at the end") + break + + # After the break statement, control returns here. Now need to decide whether to write or continue. + ds_all_results[trial-1, :] = ds_results # subtract 1 as we use 1-based indexing for the trial count. + d_all_results[trial - 1, :] = d_results # subtract 1 as we use 1-based indexing for the trial count. + if self._is_power_of_two(trial) and trial > 1: + # write the array only a logarithmic number of times + temporary_ds_results = ds_all_results[0:trial, : ] + temporary_d_results = d_all_results[0:trial, :] + print(f"#################### PARTIAL RESULTS FOR {trial} TRIALS: DATASKETCHES ####################") + previous_log_trial_index = trial + self.DataSketches_results_df = pd.DataFrame(temporary_ds_results.T, index=self.DataSketches_results_df.index, columns=np.arange(trial).tolist()) + self.DataSketches_results_df.to_csv( + self.directory_name + "/DataSketches_hll" + self.file_extension + f"trials_{trial}.csv", + index_label="n") + self.datasketch_results_df = pd.DataFrame(temporary_d_results.T, + index=self.datasketch_results_df.index, + columns=np.arange(trial).tolist()) + self.datasketch_results_df.to_csv( + self.directory_name + "/datasketch_hll" + self.file_extension + f"trials_{trial}.csv", + index_label="n" + ) + print(self.DataSketches_results_df) + + print(f"#################### PARTIAL RESULTS FOR {trial} TRIALS: datasketch ####################") + print(self.datasketch_results_df) + + +def main(argv): + assert len(argv) == 3 + SKETCH_LGK = argv['lgk'] + LG_TRIALS = argv['lgt'] + MAX_LG_N = argv['lgN'] # FOR SETTING NUMBER OF DISTINCTS + experiment = ErrorCardinalityProfile(SKETCH_LGK, LG_TRIALS, MAX_LG_N) + experiment.run() + + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='Error-cardinality profile of HyperLogLog') + parser.add_argument('-lgk', '--lgk', help='Log2(k) value for the number of buckets in the sketch.',type=int, required=True) + parser.add_argument('-lgt', '--lgt', help='Log2(T) value for number of trials t.',type=int, required=True) + parser.add_argument('-lgN', '--lgN', help='Largest permissible log2(.) value for input size.',type=int, required=True) + args = vars(parser.parse_args()) + main(args) diff --git a/jupyter/comparison-to-datasketch/single-sketch-accuracy.ipynb b/jupyter/comparison-to-datasketch/single-sketch-accuracy.ipynb new file mode 100644 index 0000000..b752c7f --- /dev/null +++ b/jupyter/comparison-to-datasketch/single-sketch-accuracy.ipynb @@ -0,0 +1,762 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b0bac2b9-77d1-435d-bd66-93c986ccddb6", + "metadata": {}, + "source": [ + "# Single Sketch Error: ASF DataSketches vs datasketch\n", + "\n", + "We compare implementations of the HyperLogLog algorithm from the [Apache Software Foundation DataSketches](https://github.com/apache/datasketches-cpp/tree/master/python) library and the open-source python [datasketch](https://github.com/ekzhu/datasketch). Both libraries present python implementations or bindings for common \"sketching\" algorithms.\n", + "In the writing we abbreviate the two implementations as `ASF:HLL` for the ASF DataSketches \n", + "and `datasketch:HLL` for the other implementation.\n", + "\n", + "This notebook needs easily available libraries that are available on PyPi." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "66bf431c-04bd-412b-8930-3741560ba0a7", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "import numpy as np\n", + "import datasketches as asf\n", + "import datasketch as ds\n", + "import mmh3" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "3d43ab6d-1774-4a37-a91b-817fb498c5a0", + "metadata": {}, + "outputs": [], + "source": [ + "# Plotting parameters\n", + "method_plot_params = {\n", + " \"asf\" : {\"color\": \"C0\", \"marker\": '.'},\n", + " \"datasketch\" : {\"color\": \"C1\", \"marker\": \"^\"}\n", + "}\n", + "asf_color = method_plot_params[\"asf\"][\"color\"]\n", + "ds__color = method_plot_params[\"datasketch\"][\"color\"]\n", + "q90_ls = \"--\"\n", + "\n", + "params = {'legend.fontsize': 'x-large',\n", + " 'axes.labelsize': 'x-large',\n", + " 'axes.titlesize':'x-large',\n", + " 'xtick.labelsize':'x-large',\n", + " 'ytick.labelsize':'x-large',\n", + " \"lines.linewidth\": 2.5}\n", + "plt.rcParams.update(params)" + ] + }, + { + "cell_type": "markdown", + "id": "a1f86685-363c-44f9-888a-6f3d4a6459cf", + "metadata": {}, + "source": [ + "### <a name=\"error_vs_cardinality\"></a>1. Error vs Cardinality \n", + "We study the error behaviour as the input cardinality increases. We provide an initial comparison to highlight the differences in the estimators.\n", + "These plots are crucial to understanding error distributions of sketches.\n", + "\n", + "To generate the synthetic data, we use \"Fibonacci Hashing\" as a cheap way to generate a pseudorandom sequence. This process starts with an initial selection of a $64$-bit integer. Then, for every new item that must be generated, we add the full $64$ bit range scaled by the integer golden ratio so that every other update _intentionally_ overflows and maps once more back into the $64$ bit range.\n", + "\n", + "#### 1a. Single sketch estimation\n", + "\n", + "From the same directory as this notebook, run\n", + "\n", + "```./cardinality_error_experiment.py -lgk 14 -lgt 7 -lgN 20``` \n", + "\n", + "which generates $8$ \"plot points\" between every power of $2$ not exceeding $N = 2^{21}$. We fix `-lgt 7` for $128$ trials and use a sketch size of with $2^{14}$. For every trial, a fresh sketch is initialised. The results are saved in a directory `hll_accuracy_profile_yyymmdd/...`\n", + "\n", + "The two sketches we compare are HyperLogLog sketches with $2^{14}$ buckets:\n", + "\n", + "*ASF DataSketches Sketch Setup*\n", + "- `hll = asf.hll_sketch(14)`\n", + "\n", + "*datasketch sketch setup*\n", + "- `h = ds.HyperLogLogPlusPlus(p=14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])`\n" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "433f73f7-7844-448c-b7fd-16122a3a8b4c", + "metadata": {}, + "outputs": [], + "source": [ + "asf_errors = pd.read_csv(\"hll_accuracy_profile_20230504/DataSketches_hll_1516lgK_14_lgT_8trials_256.csv\", index_col=0)\n", + "ds__errors = pd.read_csv(\"hll_accuracy_profile_20230504/datasketch_hll_1516lgK_14_lgT_8trials_256.csv\", index_col=0)\n", + "\n", + "lgk = 14" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "6e8fb853-1e79-4d2a-93a2-5e98c9e3c1b3", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Text(0.5, 0, 'Input cardinality $n$')" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAABCIAAAMKCAYAAABQieWGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/SrBM8AAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdvA4d+m90aHBEJJ6L33XpUioAiIIF2KYvmsrwIqtlfQABYQCUVfRIr03iF0kV4CIRBqAgkphPTM98ewLbubRkjjua8rF5mzZ2bObGaHnWfOeY5GURQFIYQQQgghhBBCiHxgVdANEEIIIYQQQgghxLNDAhFCCCGEEEIIIYTINxKIEEIIIYQQQgghRL6RQIQQQgghhBBCCCHyjQQihBBCCCGEEEIIkW8kECGEEEIIIYQQQoh8I4EIIYQQQgghhBBC5Bubgm6AEMJUSkoKaWlpBd0MIYQQQggh [...] + "text/plain": [ + "<Figure size 1200x800 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "fig, ax = plt.subplots(figsize=(12,8))\n", + "\n", + "methods = [\"ASF\", \"datasketch\"]\n", + "\n", + "for i, (method, colour, df) in enumerate(zip(methods, [asf_color, ds__color], [asf_errors, ds__errors])):\n", + " xn = df.index \n", + " median = df.median(axis=1)\n", + " q95 = df.quantile(q=0.977725, axis=1)\n", + " q05 = df.quantile(q=0.022275, axis=1) # df.mean(axis=1) - df.std(axis=1)\n", + " ax.plot(xn, median,\n", + " color=colour, label=method+\": median\")\n", + " ax.plot(xn, q95,\n", + " color=colour, linestyle=q90_ls)\n", + " ax.plot(xn, q05,\n", + " color=colour, linestyle=q90_ls, label=method+\": 90% CI\")\n", + "\n", + "ax.plot(xn, 2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n", + " color=\"red\", lw=3., linestyle=\":\", label=\"q = 0.97725\")\n", + "ax.plot(xn, -2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n", + " color=\"red\", lw=3., linestyle=\"-.\", label=\"q=0.2275\")\n", + "\n", + "ax.set_xscale('log', base=10)\n", + "ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),\n", + " ncol=3, fancybox=True)\n", + "ax.grid()\n", + "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n", + "ax.set_xlabel(r\"Input cardinality $n$\")" + ] + }, + { + "cell_type": "markdown", + "id": "c253d24a-9a21-4471-ae25-5c494a1d4dd6", + "metadata": {}, + "source": [ + "**Experiment Summary**\n", + "\n", + "The plotted red lines represent $\\pm 2 \\sigma$ where, for HLL, the standard error of the estimator is $\\sigma = 1.04 / \\sqrt{k}$. \n", + "Hence, the area between the red lines is a confidence interval. We have chosen the quantiles \n", + "$q = 0.022275, 0.977725$ so that the area between the red lines is approximately a $95\\%$ confidence interval.\n", + "This test uses a number of buckets $k = 2^{14}$. Changing the number of buckets in the sketch alters the standard error so the red lines, and the behaviour of the error curves at quantile level $q$, would change appropriately.\n", + "\n", + "\n", + "Both implementations have a median that is centered about an error of $0$, suggesting that they return unbiased estimators of the true cardinality. At small cardinalities, `asf.hll_sketch` is vastly better than the `datasketch.HyperLogLogPlusPlus`. This is because it transitions through various estimators to ensure small error. All of the curves for `ASF:HLL` have zero error until about $n = 10^3$ is in sparse mode and an exact estimator can be used. Without this behavior and re [...] + "\n", + "At large cardinalities, _for a single sketch_, `asf.hll_sketch` uses the HIP estimator which mildly improves on the standard large cardinality estimator. However, when we need to merge sketches later on, this behaviour will revert to the standard HLL estimator.\n", + "\n", + "### 2. Single sketch estimation: Real data\n", + "Now let's see what happens when we want sketch some real data. \n", + "We will download an opensource dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "1131e64e-cd98-4524-a33a-fd97eea1816d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " % Total % Received % Xferd Average Speed Time Time Time Current\n", + " Dload Upload Total Spent Left Speed\n", + "100 110M 0 110M 0 0 1150k 0 --:--:-- 0:01:38 --:--:-- 1231k8k 0 --:--:-- 0:00:23 --:--:-- 1164k 0 0 913k 0 --:--:-- 0:00:30 --:--:-- 1065k 1193k 0 --:--:-- 0:01:09 --:--:-- 1544k\n" + ] + } + ], + "source": [ + "!curl -o bitcoin.zip \"https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip\"" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "f2ee8d0e-cad9-4d61-8ff1-4e0a7db66aae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Archive: bitcoin.zip\n", + " inflating: BitcoinHeistData.csv \n" + ] + } + ], + "source": [ + "!unzip bitcoin.zip" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "033e8d4b-4ac9-4260-a74c-6aaafd4b0552", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>address</th>\n", + " <th>year</th>\n", + " <th>day</th>\n", + " <th>length</th>\n", + " <th>weight</th>\n", + " <th>count</th>\n", + " <th>looped</th>\n", + " <th>neighbors</th>\n", + " <th>income</th>\n", + " <th>label</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>111K8kZAEnJg245r2cM6y9zgJGHZtJPy6</td>\n", + " <td>2017</td>\n", + " <td>11</td>\n", + " <td>18</td>\n", + " <td>0.008333</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>100050000.0</td>\n", + " <td>princetonCerber</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>1123pJv8jzeFQaCV4w644pzQJzVWay2zcA</td>\n", + " <td>2016</td>\n", + " <td>132</td>\n", + " <td>44</td>\n", + " <td>0.000244</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>100000000.0</td>\n", + " <td>princetonLocky</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>112536im7hy6wtKbpH1qYDWtTyMRAcA2p7</td>\n", + " <td>2016</td>\n", + " <td>246</td>\n", + " <td>0</td>\n", + " <td>1.000000</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>200000000.0</td>\n", + " <td>princetonCerber</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7</td>\n", + " <td>2016</td>\n", + " <td>322</td>\n", + " <td>72</td>\n", + " <td>0.003906</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>71200000.0</td>\n", + " <td>princetonCerber</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>1129TSjKtx65E35GiUo4AYVeyo48twbrGX</td>\n", + " <td>2016</td>\n", + " <td>238</td>\n", + " <td>144</td>\n", + " <td>0.072848</td>\n", + " <td>456</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>200000000.0</td>\n", + " <td>princetonLocky</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " address year day length weight count \\\n", + "0 111K8kZAEnJg245r2cM6y9zgJGHZtJPy6 2017 11 18 0.008333 1 \n", + "1 1123pJv8jzeFQaCV4w644pzQJzVWay2zcA 2016 132 44 0.000244 1 \n", + "2 112536im7hy6wtKbpH1qYDWtTyMRAcA2p7 2016 246 0 1.000000 1 \n", + "3 1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7 2016 322 72 0.003906 1 \n", + "4 1129TSjKtx65E35GiUo4AYVeyo48twbrGX 2016 238 144 0.072848 456 \n", + "\n", + " looped neighbors income label \n", + "0 0 2 100050000.0 princetonCerber \n", + "1 0 1 100000000.0 princetonLocky \n", + "2 0 2 200000000.0 princetonCerber \n", + "3 0 2 71200000.0 princetonCerber \n", + "4 0 1 200000000.0 princetonLocky " + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bitcoin_df = pd.read_csv(\"BitcoinHeistData.csv\", header=0)\n", + "bitcoin_df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "ba7fc55f-7214-4b20-a6c7-822207bb5922", + "metadata": {}, + "source": [ + "Let's focus on the simple task of counting how many unique addresses are present in the dataset.\n", + "With native pandas functionality, we see that there are about $2.6$ million unique addresses.\n", + "We will use HLL sketches to estimate this count." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "52a71365-d54d-4295-b714-4ce6201be6c6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 811 ms, sys: 33.8 ms, total: 845 ms\n", + "Wall time: 846 ms\n" + ] + } + ], + "source": [ + "%%time\n", + "true_count = bitcoin_df[\"address\"].nunique()" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "07041390-365d-4dff-91cd-7f20bc50273c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "There are 2631095 unique addresses\n" + ] + } + ], + "source": [ + "print(f\"There are {true_count} unique addresses\")" + ] + }, + { + "cell_type": "markdown", + "id": "8cded8f4-1d0c-4dcc-b4cd-73015394a925", + "metadata": {}, + "source": [ + "Now define equivalent sketches from both libraries.\n", + "We use $2^{14}$ buckets and $8$-bit HyperLogLog sketches for each implementation. The `datsketch:HLL` uses the MurmurHash library so that we have equivalent sketches for comparison. " + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "2bd9ab89-a705-4040-b14a-014a18fff9bb", + "metadata": {}, + "outputs": [], + "source": [ + "asf_hll = asf.hll_sketch(14, asf.HLL_8)\n", + "ds_hll = ds.HyperLogLogPlusPlus(p=14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "356e152b-47f1-48d9-ae2f-bc5bd17ad244", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 2.08 s, sys: 8.87 ms, total: 2.08 s\n", + "Wall time: 2.09 s\n" + ] + } + ], + "source": [ + "%%time\n", + "for ad in bitcoin_df[\"address\"]:\n", + " asf_hll.update(ad)" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "3540529f-e8ed-4f20-98c3-1f5dabf8ef12", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 3.65 s, sys: 10.4 ms, total: 3.66 s\n", + "Wall time: 3.66 s\n" + ] + } + ], + "source": [ + "%%time\n", + "for ad in bitcoin_df[\"address\"]:\n", + " ds_hll.update(ad)" + ] + }, + { + "cell_type": "markdown", + "id": "c2d3e4f3-b878-4945-a3dd-5fe513a86a6a", + "metadata": {}, + "source": [ + "On this simple example we see that the datasketches implementation takes about $2$ seconds compared to almost $4$ for `datasketch`. Note that these times are longer than for the native Pandas call to nunique; this is not a problem because, unlike `pd.nunique(.)` the sketches are designed for large datasets not entirely held in memory. The figures quoted are on a $2023$ Apple Macbook Pro with Apple M1 Pro and absolute numbers may differ slightly.\n", + "\n", + "For estimation accuracy, we have the following behaviour for the single sketches. Since we have the true count, we can also evaluate the error." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "d529da07-3752-4685-8203-7dce6dbfa892", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ASF estimate: 2650083.4660\n", + "ASF error: 0.7217 %\n" + ] + } + ], + "source": [ + "# DataSketches\n", + "print(f\"ASF estimate: {asf_hll.get_estimate():.4f}\")\n", + "print(f\"ASF error: {100*(asf_hll.get_estimate()-true_count)/true_count:.4f} %\")" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "71531fc9-8191-4f0d-83a3-e2ccee38d56c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "datasketch estimate: 2646133.7361\n", + "datasketch error: 0.5716 %\n" + ] + } + ], + "source": [ + "# Datasketch\n", + "print(f\"datasketch estimate: {ds_hll.count():.4f}\")\n", + "print(f\"datasketch error: {100*(ds_hll.count() - true_count)/true_count:.4f} %\")" + ] + }, + { + "cell_type": "markdown", + "id": "c2b08dd1-00c1-4f15-93bb-51ba60f23a69", + "metadata": {}, + "source": [ + "On this example, the datasketch implementation has a lower error than the ASF method. However, this was a single sketch so we cannot draw any strong conclusions. Rather, we would have to study the error distribution as previously done in [Section 1](error_vs_cardinality). \n", + "\n", + "We run $25$ independent trials of each algorithm, each trial with a fresh sketch.\n", + "Since HLL is deterministic given the hash seed, with no change to the input we would obtain the same output every time. To avoid this, we prepend the trial number to every incoming string so that the number of unique items remains the same but the streams are superficially different." + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "1c5a7c42-cdf7-4632-a96d-c23a7bccc95e", + "metadata": {}, + "outputs": [], + "source": [ + "lgk = 14\n", + "num_trials = 25\n", + "\n", + "all_asf_hll = [asf.hll_sketch(14, asf.HLL_8) for _ in range(num_trials)]\n", + "all_ds_hll = [ds.HyperLogLogPlusPlus(p=14, hashfunc=lambda x: mmh3.hash64(x, signed=False)[0]) for _ in range(num_trials)]\n", + "\n", + "asf_hll_estimates = np.zeros((num_trials,), dtype=float)\n", + "asf_hll_errors = np.zeros_like(asf_hll_estimates)\n", + "ds__hll_estimates = np.zeros_like(asf_hll_estimates)\n", + "ds__hll_errors = np.zeros_like(asf_hll_estimates)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "e51ee247-6f96-4418-a732-3f2f7d77dc87", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 57.9 s, sys: 131 ms, total: 58 s\n", + "Wall time: 58.2 s\n" + ] + } + ], + "source": [ + "%%time\n", + "for trial in range(num_trials):\n", + " for ad in bitcoin_df[\"address\"]:\n", + " all_asf_hll[trial].update(str(trial) + ad)\n", + " asf_hll_estimates[trial] = all_asf_hll[trial].get_estimate()" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "768248dc-43d9-41d6-a379-7c9a7b60eb6d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 1min 41s, sys: 266 ms, total: 1min 41s\n", + "Wall time: 1min 41s\n" + ] + } + ], + "source": [ + "%%time\n", + "for trial in range(num_trials):\n", + " for i, ad in enumerate(bitcoin_df[\"address\"]):\n", + " all_ds_hll[trial].update(str(trial) + ad)\n", + " ds__hll_estimates[trial] = all_ds_hll[trial].count()" + ] + }, + { + "cell_type": "markdown", + "id": "c1303e34-cc8f-4975-ade1-4f68b0e006d7", + "metadata": {}, + "source": [ + "The ASF HLL runs in about half of the time as the datasketch implementation.\n", + "However, we are also interested in the distribution of errors for each sketch implementation.\n", + "Since we have fewer trials than in Section 1, we plot a box and whisker diagram which is still useful in understanding the error distribution, despite being less informative about the full error distribution than the pitchfork plots from Section 1. The plot can be interpreted as a cross-section of the pitchfork plot at the vertical line $n = 2631095$." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "33a089c7-f8b5-4320-a58e-17f6e590b1a8", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAnwAAAGiCAYAAAB9MRuUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/SrBM8AAAACXBIWXMAAA9hAAAPYQGoP6dpAABzy0lEQVR4nO3deVxU1f8/8NewLwqCaIIgCIpSmvsGiYofcCs1IbWPWmqWWq5UP7VFxFyqT2lamqWFpn7LRAP8aAQfVFBQBJe0EsdYVERzQWeQdYD7++PKyDggAwzLzLyejwePmTnn3HPeMI68OffecySCIAggIiIiIr1l1NQBEBEREVHDYsJHREREpOeY8BERERHpOSZ8RERERHqOCR8RERGRnmPCR0RERKTnmPARERER6TkmfERERER6zqSpA6Dmoby8HDk5OWjZsiUkEklTh0NEREQaEAQBeXl5cHJygpFR9fN4 [...] + "text/plain": [ + "<Figure size 640x480 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "asf_hll_errors = (asf_hll_estimates - true_count) / true_count\n", + "datasketch_hll_errors = (ds__hll_estimates - true_count) / true_count\n", + "\n", + "for arr in [asf_hll_errors, datasketch_hll_errors]:\n", + " arr.sort()\n", + "\n", + "fig, ax = plt.subplots()\n", + "error_data = {\"ASF\": asf_hll_errors, \"datasketch\": datasketch_hll_errors}\n", + "\n", + "#\n", + "box_plot = ax.boxplot(list(error_data.values()), vert=True)\n", + "for median in box_plot['medians']:\n", + " median.set_color('black')\n", + " \n", + "\n", + "\n", + "ax.scatter(np.ones_like(asf_hll_errors), asf_hll_errors, marker=\".\")\n", + "ax.scatter(2*np.ones_like(datasketch_hll_errors), datasketch_hll_errors, marker=\"x\")\n", + "ax.set_xticks([1,2], list(error_data.keys()))\n", + "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n", + "ax.grid()" + ] + }, + { + "cell_type": "markdown", + "id": "6611561e-75ce-4a2e-a6a1-00b8cdef81f5", + "metadata": {}, + "source": [ + "The error statistics from the experiment are as follows." + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "350a4171-0505-48e1-ac2c-98afa05387b6", + "metadata": {}, + "outputs": [], + "source": [ + "experiment_error_statistics = {\n", + " \"ASF\": {\"median\" : None, \"IQR\" : None}, \n", + " \"datasketch\": {\"median\" : None, \"IQR\" : None}\n", + "}\n", + "key_list = list(experiment_error_statistics.keys())\n", + "\n", + "for i, m in enumerate(box_plot[\"medians\"]):\n", + " method_median = m.get_data()[1][0]\n", + " experiment_error_statistics[key_list[i]][\"median\"] = method_median\n", + "\n", + "for i, line in enumerate(box_plot[\"boxes\"]):\n", + " liney = line.get_ydata()\n", + " iqr = liney.max() - liney.min()\n", + " experiment_error_statistics[key_list[i]][\"IQR\"] = iqr" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "09f7c6eb-ff32-4679-ac0e-2d81bcd95cf1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'ASF': {'median': 0.0015215127060363715, 'IQR': 0.009980189698406661},\n", + " 'datasketch': {'median': 0.0025901816561035075, 'IQR': 0.015456189289630978}}" + ] + }, + "execution_count": 61, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "experiment_error_statistics" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "id": "c3cf31cb-9769-4e2b-9645-10a502ad5d6b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Library Median Error IQR\n", + "----------------------------------------\n", + "ASF 0.0015 0.0100\n", + "datasketch 0.0026 0.0155\n", + "\n", + "\n", + "\n", + "ErrorxChange Median Error IQR\n", + "----------------------------------------\n", + " 0.5874 0.6457\n" + ] + } + ], + "source": [ + "print(\"{:<12}{:>14}{:>14}\".format(\"Library\", \"Median Error\", \"IQR\"))\n", + "print(\"-\"*40)\n", + "for k,vd in experiment_error_statistics.items():\n", + " print(\"{:<12}{:>14.4f}{:>14.4f}\".format(k, vd[\"median\"], vd[\"IQR\"]))\n", + " \n", + "print(\"\\n\"*2)\n", + "print(\"{:<12}{:>14}{:>14}\".format(\"ErrorxChange\", \"Median Error\", \"IQR\"))\n", + "print(\"-\"*40)\n", + "median_factor = experiment_error_statistics[\"ASF\"][\"median\"] / experiment_error_statistics[\"datasketch\"][\"median\"]\n", + "iqr_factor = experiment_error_statistics[\"ASF\"][\"IQR\"] / experiment_error_statistics[\"datasketch\"][\"IQR\"]\n", + "print(\"{:<12}{:>14.4f}{:>14.4f}\".format(\"\", median_factor, iqr_factor))" + ] + }, + { + "cell_type": "markdown", + "id": "cbb58e63-cbba-43ac-a015-05b9471b6212", + "metadata": {}, + "source": [ + "The columns in the second table above are found by taking the ratio of the smaller ASF results and the larger datasketch results. \n", + "They indicate that the median error in using the ASF library is about $58\\%$ of that incurred when using the datasketch library and the interquartile range is tighter, being about $65\\%$ as large as that from the `datasketch` implementation." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/jupyter/comparison-to-datasketch/utils.py b/jupyter/comparison-to-datasketch/utils.py new file mode 100644 index 0000000..d69e892 --- /dev/null +++ b/jupyter/comparison-to-datasketch/utils.py @@ -0,0 +1,37 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import numpy as np + +def distinct_number_sequence(start:np.uint64=0) -> np.uint64: + """Generator function to make 64-bit numbers that are distinct.""" + assert isinstance(start, np.uint64) + num = start + golden_ratio = np.uint64(0x9e3779b97f4a7c13) + while True: + yield num + num += np.uint64(golden_ratio) + +if __name__ == '__main__': + start = np.uint64(2345234635) + ndistincts = np.uint64(10) + count = 0 + for i in distinct_number_sequence(start): + print(i, end="\n") + count += 1 + if count == ndistincts: + break --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
