(datasketches-python) 01/01: Initial error comparison

charlie Thu, 16 Nov 2023 08:10:53 -0800

This is an automated email from the ASF dual-hosted git repository.

charlie pushed a commit to branch lib-comparison
in repository https://gitbox.apache.org/repos/asf/datasketches-python.git


commit 8dfa66eaae05589cad2b90c8c8d9ef9a88958fb4
Author: Charlie Dickens <[email protected]>
AuthorDate: Thu Nov 16 16:07:54 2023 +0000

    Initial error comparison
---
 jupyter/comparison-to-datasketch/README.md         |  37 +
 .../comparison-to-datasketch/api-comparison.ipynb  | 922 +++++++++++++++++++++
 .../cardinality_error_experiment.py                | 172 ++++
 .../single-sketch-accuracy.ipynb                   | 762 +++++++++++++++++
 jupyter/comparison-to-datasketch/utils.py          |  37 +
 5 files changed, 1930 insertions(+)

diff --git a/jupyter/comparison-to-datasketch/README.md 
b/jupyter/comparison-to-datasketch/README.md
new file mode 100644
index 0000000..ed03e61
--- /dev/null
+++ b/jupyter/comparison-to-datasketch/README.md
@@ -0,0 +1,37 @@
+# A Comparison of Python Sketching Libraries
+
+These notebooks aim to provide a fair comparison between the Python support 
for 
+[Apache Software Foundation (ASF) 
DataSketches](https://pypi.org/project/datasketches/) and the 
[datasketch](https://pypi.org/project/datasketch/)
+library.    
+The notebooks are not an attempt to fully characterize either library's 
implementation, rather to highlight differences that manifest as a result 
+of design choices for each library.
+
+## Summary
+
+As of the time of writing, November 2023, the versions under comparison are:
+```
+Name: datasketches
+Version: 4.1.0
+
+Name: datasketch
+Version: 1.6.4
+```
+
+## 1. Distinct Counting 
+
+### 1a. HyperLogLog
+Each library implements various algorithms that fall under the HyperLogLog 
umbrella. 
+They can be summarized as follows
+
+|                       |   ASF                                                
                                                                                
      |   datasketch                                   |                        
                                                                                
   |
+|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
+|                       |   HyperLogLog                                        
                                                                                
      |   HyperLogLog                                  |   HyperLogLog++        
                                                                                
   |
+|   Hash functions      |   64 bit MurmurHash                                  
                                                                                
      |   SHA1 32 bit.  Others are possible.           |   SHA1 64 bit.  Others 
are possible.                                                                   
   |
+|   Bucket sizes (bits) |   4, 6, 8                                            
                                                                                
      |   4                                            |   8                    
                                                                                
   |
+|   Estimator           | Exact mode for small cardinalities  Historic Inverse 
Probability (HIP) for a single sketch; Harmonic mean after merging multiple 
sketches  | No exact mode.  Harmonic mean for all sketches | Harmonic mean for 
all sketches (single or merged) with bias corrections at small and large 
cardinalities  |
+
+
+Since `datasketch.hyperloglog.HyperLogLog` uses only $32$ bit hash functions, 
we do not regard this as an appropriate solution for 
+industry-applications because the error estimates will vary wildly when the 
input is large enough.
+To ensure the fairest comparison between the sketches, we will only compare 
the ASF implementations with `datasketch.hyperloglog.HyperLogLogPlusPlus`
+with the MurmurHash function over $64$ bits using `mmh3.hash64(x, 
signed=False)[0]`.
\ No newline at end of file
diff --git a/jupyter/comparison-to-datasketch/api-comparison.ipynb 
b/jupyter/comparison-to-datasketch/api-comparison.ipynb
new file mode 100644
index 0000000..3d6dabb
--- /dev/null
+++ b/jupyter/comparison-to-datasketch/api-comparison.ipynb
@@ -0,0 +1,922 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "547cbc07",
+   "metadata": {},
+   "source": [
+    "# Implementation Comparison ASF DataSketches vs datasketch\n",
+    "\n",
+    "We compare implementations of the HyperLogLog algorithm from the [Apache 
Software Foundation 
DataSketches](https://github.com/apache/datasketches-cpp/tree/master/python) 
library and the open-source python 
[datasketch](https://github.com/ekzhu/datasketch).  Both libraries present 
python implementations or bindings for common \"sketching\" algorithms.\n",
+    "In the writing we abbreviate the two implementations as `ASF:HLL` for the 
ASF DataSketches \n",
+    "and `datasketch:HLL` for the other implementation.\n",
+    "\n",
+    "This notebook needs easily available libraries that are available on 
PyPi."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "7e9b5d90",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline\n",
+    "import numpy as np\n",
+    "import datasketches as asf\n",
+    "import datasketch as ds\n",
+    "import mmh3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "33a4124b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plotting parameters\n",
+    "method_plot_params = {\n",
+    "    \"asf\" : {\"color\": \"C0\", \"marker\": '.'},\n",
+    "    \"datasketch\" : {\"color\": \"C1\", \"marker\": \"^\"}\n",
+    "}\n",
+    "asf_color = method_plot_params[\"asf\"][\"color\"]\n",
+    "ds__color = method_plot_params[\"datasketch\"][\"color\"]\n",
+    "q90_ls = \"--\"\n",
+    "\n",
+    "params = {'legend.fontsize': 'x-large',\n",
+    "     'axes.labelsize': 'x-large',\n",
+    "     'axes.titlesize':'x-large',\n",
+    "     'xtick.labelsize':'x-large',\n",
+    "     'ytick.labelsize':'x-large',\n",
+    "      \"lines.linewidth\": 2.5}\n",
+    "plt.rcParams.update(params)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11f9a14b",
+   "metadata": {},
+   "source": [
+    "## HyperLogLog API\n",
+    "The two libraries implement HyperLogLog (HLL) slightly differently.  We 
aim to find a baseline that highlights these differences.  The following tables 
summarize the differences.\n",
+    "\n",
+    "| Feature | ASF DataSketches |\n",
+    "| --- | --- |\n",
+    "| **Type of hash functions** | 64-bit Murmurhash only   |\n",
+    "| **Bucket sizes (bits)** |  $4,6$ or $8$ bits.  Bucket size does not 
affect accuracy, only sketch size in bits. | \n",
+    "|**Estimator** | Smoothly transitions across various estimators for 
greatest accuracy.  Implements a higher accuracy estimator for the \"single\" 
sketch setting [cite here] |\n",
+    "\n",
+    "| Feature | datasketch |\n",
+    "| --- | --- |\n",
+    "| **Type of hash functions** |  `hashlib` SHA1 by default but any other 
callable hash functions (e.g. 32 or 64 bit Murmurhash or xxhash) |\n",
+    "| **Bucket sizes (bits)** | $4$ for $32$ bit hash, $8$ for $64$ bit hash. 
| \n",
+    "|**Estimator** | Only the \"standard\" HLL estimators. |\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9e0d0c7a",
+   "metadata": {},
+   "source": [
+    "### <a name=\"error_vs_cardinality\"></a>1. Error vs Cardinality \n",
+    "We study the error behaviour as the input cardinality increases.  We 
provide an initial comparison to highlight the differences in the 
estimators.\n",
+    "These plots are crucial to understanding error distributions of sketches 
and are not presented in the `datasketch` library documentation.\n",
+    "\n",
+    "To generate the synthetic data, we use \"Fibonacci Hashing\" as a cheap 
way to generate a pseudorandom sequence.  This process starts with an initial 
selection of a $64$-bit integer.  Then, for every new item that must be 
generated, we add the full $64$ bit range scaled by the integer golden ratio so 
that every other update _intentionally_ overflows and maps once more back into 
the $64$ bit range.\n",
+    "\n",
+    "#### 1a. Single sketch estimation\n",
+    "\n",
+    "From the project root run\n",
+    "\n",
+    "```./cardinality_error_experiment.py -lgk 14 -lgt 7  -lgN 20``` \n",
+    "\n",
+    "which generates $8$ \"plot points\" between every power of $2$ not 
exceeding $N = 2^{21}$.  We fix `-lgt 7` for $128$ trials and use a sketch size 
of with $2^{14}$.  For every trial, an fresh sketch is initialised.\n",
+    "\n",
+    "*ASF DataSketches Sketch Setup*\n",
+    "- `hll = asf.hll_sketch(self.sketch_lgk)`\n",
+    "\n",
+    "*datasketch sketch setup*\n",
+    "- `h = ds.HyperLogLogPlusPlus(p=self.sketch_lgk, hashfunc=lambda x: 
mmh3.hash64(x, signed=False)[0])`\n",
+    "\n",
+    "This setting uses the default target type for the datasketches 
implementation which is $8$ bits per bucket.\n",
+    "Does HLL bucket size only affect the serialization?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "1a237439",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "asf_errors = 
pd.read_csv(\"../hll_accuracy_profile_20230504/DataSketches_hll_1516lgK_14_lgT_8trials_256.csv\",
 index_col=0)\n",
+    "ds__errors = 
pd.read_csv(\"../hll_accuracy_profile_20230504/datasketch_hll_1516lgK_14_lgT_8trials_256.csv\",
 index_col=0)\n",
+    "\n",
+    "lgk = 14"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "a542b13c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Text(0.5, 0, 'Input cardinality $n$')"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "data": {
+      "image/png": 
"iVBORw0KGgoAAAANSUhEUgAABCIAAAMKCAYAAABQieWGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdvA4d+m90aHBEJJ6L33XpUioAiIIF2KYvmsrwIqtlfQABYQCUVfRIr03iF0kV4CIRBqAgkphPTM98ewLbubRkjjua8rF5mzZ2bObGaHnWfOeY5GURQFIYQQQgghhBBCiHxgVdANEEIIIYQQQgghxLNDAhFCCCGEEEIIIYTINxKIEEIIIYQQQgghRL6RQIQQQgghhBBCCCHyjQQihBBCCCGEEEIIkW8kECGEEEIIIYQQQoh8I4EIIYQQQgghhBBC5Bubgm6AEMJUSkoKaWlpBd0MIYQQQggh
 [...]
+      "text/plain": [
+       "<Figure size 1200x800 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(12,8))\n",
+    "\n",
+    "methods = [\"ASF\", \"datasketch\"]\n",
+    "\n",
+    "for i, (method, colour, df) in enumerate(zip(methods, [asf_color, 
ds__color], [asf_errors, ds__errors])):\n",
+    "    xn = df.index \n",
+    "    median = df.median(axis=1)\n",
+    "    q95 = df.quantile(q=0.977725, axis=1)\n",
+    "    q05 = df.quantile(q=0.022275, axis=1) # df.mean(axis=1) - 
df.std(axis=1)\n",
+    "    ax.plot(xn, median,\n",
+    "           color=colour, label=method+\": median\")\n",
+    "    ax.plot(xn, q95,\n",
+    "           color=colour, linestyle=q90_ls)\n",
+    "    ax.plot(xn, q05,\n",
+    "           color=colour, linestyle=q90_ls, label=method+\": 90% CI\")\n",
+    "\n",
+    "ax.plot(xn, 2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n",
+    "        color=\"red\", lw=3., linestyle=\":\", label=\"q = 0.97725\")\n",
+    "ax.plot(xn, -2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n",
+    "        color=\"red\", lw=3., linestyle=\"-.\",  label=\"q=0.2275\")\n",
+    "\n",
+    "ax.set_xscale('log', base=10)\n",
+    "ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),\n",
+    "          ncol=3, fancybox=True)\n",
+    "ax.grid()\n",
+    "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n",
+    "ax.set_xlabel(r\"Input cardinality $n$\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce0cc56d",
+   "metadata": {},
+   "source": [
+    "**Experiment Summary**\n",
+    "\n",
+    "The plotted red lines represent $\\pm 2 \\sigma$ where, for HLL, the 
standard error of the estimator is $\\sigma = 1.04 / \\sqrt{k}$. \n",
+    "Hence, the area between the red lines is a confidence interval.  We have 
chosen the quantiles \n",
+    "$q = 0.022275, 0.977725$ so that the area between the red lines is 
approximately a $90\\%$ confidence interval.\n",
+    "This test uses a number of buckets $k = 2^{14}$.  Changing the number of 
buckets in the sketch alters the standard error so the red lines, and the 
behaviour of the error curves at quantile level $q$, would change 
appropriately.\n",
+    "\n",
+    "\n",
+    "Both implementations have a median that is centered about an error of 
$0$, suggesting that they are unbiased estimators.  At small cardinalities, 
`ASF:HLL` is vastly better than the `datasketch:HLL`.  This is because it 
transitions through various estimators to ensure small error.  All of the 
curves for `ASF:HLL` have zero error until about $n = 10^3$ because it is in 
sparse mode and a different estimator can be used.  Without this behaviour and 
resorting to the standard change of e [...]
+    "\n",
+    "At large cardinalities, _for a single sketch_, `ASF:HLL` uses the HIP 
estimator which mildly improves on the standard large cardinality estimator.  
However, when we need to merge sketches later on, this behaviour will revert to 
the standard HLL estimator.\n",
+    "\n",
+    "### 2.  Single sketch estimation: Real data\n",
+    "Now let's see what happens when we want sketch some real data.  \n",
+    "We will download an opensource dataset from the [UCI Machine Learning 
Repository](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 135,
+   "id": "91403f05",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  
Speed\n",
+      "100  110M  100  110M    0     0  5252k      0  0:00:21  0:00:21 
--:--:-- 6128k258k      0  0:07:18  0:00:01  0:07:17  260k3:51  0:00:02  
0:03:49  491k40k      0  0:00:22  0:00:16  0:00:06 6121k\n"
+     ]
+    }
+   ],
+   "source": [
+    "!curl -o bitcoin.zip 
\"https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip\"";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 137,
+   "id": "b1ffc601",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Archive:  bitcoin.zip\n",
+      "  inflating: BitcoinHeistData.csv    \n"
+     ]
+    }
+   ],
+   "source": [
+    "!unzip bitcoin.zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "6b139532",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>address</th>\n",
+       "      <th>year</th>\n",
+       "      <th>day</th>\n",
+       "      <th>length</th>\n",
+       "      <th>weight</th>\n",
+       "      <th>count</th>\n",
+       "      <th>looped</th>\n",
+       "      <th>neighbors</th>\n",
+       "      <th>income</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111K8kZAEnJg245r2cM6y9zgJGHZtJPy6</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>11</td>\n",
+       "      <td>18</td>\n",
+       "      <td>0.008333</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>100050000.0</td>\n",
+       "      <td>princetonCerber</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1123pJv8jzeFQaCV4w644pzQJzVWay2zcA</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>132</td>\n",
+       "      <td>44</td>\n",
+       "      <td>0.000244</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>100000000.0</td>\n",
+       "      <td>princetonLocky</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>112536im7hy6wtKbpH1qYDWtTyMRAcA2p7</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>246</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>200000000.0</td>\n",
+       "      <td>princetonCerber</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>322</td>\n",
+       "      <td>72</td>\n",
+       "      <td>0.003906</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>71200000.0</td>\n",
+       "      <td>princetonCerber</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1129TSjKtx65E35GiUo4AYVeyo48twbrGX</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>238</td>\n",
+       "      <td>144</td>\n",
+       "      <td>0.072848</td>\n",
+       "      <td>456</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>200000000.0</td>\n",
+       "      <td>princetonLocky</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                              address  year  day  length    weight  
count   \n",
+       "0   111K8kZAEnJg245r2cM6y9zgJGHZtJPy6  2017   11      18  0.008333     
 1  \\\n",
+       "1  1123pJv8jzeFQaCV4w644pzQJzVWay2zcA  2016  132      44  0.000244     
 1   \n",
+       "2  112536im7hy6wtKbpH1qYDWtTyMRAcA2p7  2016  246       0  1.000000     
 1   \n",
+       "3  1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7  2016  322      72  0.003906     
 1   \n",
+       "4  1129TSjKtx65E35GiUo4AYVeyo48twbrGX  2016  238     144  0.072848    
456   \n",
+       "\n",
+       "   looped  neighbors       income            label  \n",
+       "0       0          2  100050000.0  princetonCerber  \n",
+       "1       0          1  100000000.0   princetonLocky  \n",
+       "2       0          2  200000000.0  princetonCerber  \n",
+       "3       0          2   71200000.0  princetonCerber  \n",
+       "4       0          1  200000000.0   princetonLocky  "
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bitcoin_df = pd.read_csv(\"BitcoinHeistData.csv\", header=0)\n",
+    "bitcoin_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e84320f6",
+   "metadata": {},
+   "source": [
+    "Let's focus on the simple task of counting how many unique addresses are 
present in the dataset.\n",
+    "With native pandas functionality, we see that there are about $2.6$ 
million unique addresses.\n",
+    "We will use HLL sketches to estimate this count."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "00a8f7bf",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 1.15 s, sys: 89.9 ms, total: 1.24 s\n",
+      "Wall time: 1.26 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "true_count = bitcoin_df[\"address\"].nunique()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "30521d71",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "There are 2631095 unique addresses\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"There are {true_count} unique addresses\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "318fef9e",
+   "metadata": {},
+   "source": [
+    "Now define equivalent sketches from both libraries.\n",
+    "We use $2^{14}$ buckets and $8$-bit HyperLogLog sketches for each 
implementation.  The `datsketch:HLL` uses the MurmurHash library so that we 
have equivalent sketches for comparison."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "a5ee01e2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "asf_hll = asf.hll_sketch(14, asf.HLL_8)\n",
+    "ds_hll = ds.HyperLogLogPlusPlus(p=14,  hashfunc=lambda x: mmh3.hash64(x, 
signed=False)[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "3652c114",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 3.93 s, sys: 32 ms, total: 3.96 s\n",
+      "Wall time: 4.01 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for ad in bitcoin_df[\"address\"]:\n",
+    "    asf_hll.update(ad)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "5b59567b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 9.33 s, sys: 36.8 ms, total: 9.37 s\n",
+      "Wall time: 9.52 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for ad in bitcoin_df[\"address\"]:\n",
+    "    ds_hll.update(ad)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aedd1cbe",
+   "metadata": {},
+   "source": [
+    "On this simple example we see that the datasketches implementation takes 
about $4$ seconds compared to about $10$ for datasketch.  Note that these times 
are longer than for the native Pandas call to nunique; this is not a problem 
because, unlike `pd.nunique(.)` the sketches are designed for large datasets 
not entirely held in memory.\n",
+    "\n",
+    "For estimation, we have the following behaviour for the single sketches.  
Since we have the true count, we can also evaluate the error."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "e2c2fb5c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ASF estimate: 2650083.4660\n",
+      "ASF error: 0.7217  %\n"
+     ]
+    }
+   ],
+   "source": [
+    "# DataSketches\n",
+    "print(f\"ASF estimate: {asf_hll.get_estimate():.4f}\")\n",
+    "print(f\"ASF error: 
{100*(asf_hll.get_estimate()-true_count)/true_count:.4f}  %\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "04607338",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "datasketch estimate: 2646133.7361\n",
+      "datasketch error: 0.5716  %\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Datasketch\n",
+    "print(f\"datasketch estimate: {ds_hll.count():.4f}\")\n",
+    "print(f\"datasketch error: {100*(ds_hll.count() - 
true_count)/true_count:.4f}  %\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d88fc63f",
+   "metadata": {},
+   "source": [
+    "On this example, the datasketch implementation has a lower error than the 
ASF method.  However, this was a single sketch so we cannot draw any strong 
conclusions.  Rather, we would have to study the error distribution as 
previously done in [Section 1](error_vs_cardinality).  \n",
+    "\n",
+    "We run $25$ independent trials of each algorithm, each trial with a fresh 
sketch.\n",
+    "Since HLL is deterministic given the hash seed, with no change to the 
input we would obtain the same output every time.  To avoid this, we prepend 
the trial number to every incoming string so that the number of unique items 
remains the same but the streams are superficially different."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "4e7b941c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lgk = 14\n",
+    "num_trials = 25\n",
+    "\n",
+    "all_asf_hll = [asf.hll_sketch(14, asf.HLL_8) for _ in 
range(num_trials)]\n",
+    "all_ds_hll = [ds.HyperLogLogPlusPlus(p=14,  hashfunc=lambda x: 
mmh3.hash64(x, signed=False)[0]) for _ in range(num_trials)]\n",
+    "\n",
+    "asf_hll_estimates = np.zeros((num_trials,), dtype=float)\n",
+    "asf_hll_errors = np.zeros_like(asf_hll_estimates)\n",
+    "ds__hll_estimates = np.zeros_like(asf_hll_estimates)\n",
+    "ds__hll_errors = np.zeros_like(asf_hll_estimates)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "ec3fb80b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 2min 2s, sys: 1.34 s, total: 2min 3s\n",
+      "Wall time: 2min 9s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for trial in range(num_trials):\n",
+    "    for ad in bitcoin_df[\"address\"]:\n",
+    "        all_asf_hll[trial].update(str(trial) + ad)\n",
+    "    asf_hll_estimates[trial] = all_asf_hll[trial].get_estimate()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "68f3e787",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 4min 18s, sys: 806 ms, total: 4min 19s\n",
+      "Wall time: 4min 21s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for trial in range(num_trials):\n",
+    "    for i, ad in enumerate(bitcoin_df[\"address\"]):\n",
+    "        all_ds_hll[trial].update(str(trial) + ad)\n",
+    "    ds__hll_estimates[trial] = all_ds_hll[trial].count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "254e2d14",
+   "metadata": {},
+   "source": [
+    "The ASF HLL runs in about half of the time as the datasketch 
implementation.\n",
+    "However, we are also interested in the distribution of errors for each 
sketch implementation.\n",
+    "Since we have fewer trials than in Section 1, we plot a box and whisker 
diagram which is still useful in understanding the error distribution, despite 
being less informative about the full error distribution than the pitchfork 
plots from Section 1.  The plot can be interpreted as a cross-section of the 
pitchfork plot at the vertical line $n = 2631095$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "a00a0f70",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": 
"iVBORw0KGgoAAAANSUhEUgAAAnwAAAGiCAYAAAB9MRuUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABz8klEQVR4nO3deVzU1f4/8NewLwqCaIIgCIpSmvsGiYoX3EotSL1XLTVLLVeqn9oiYi7VLU1Ls7TQ1G+ZaIBXI7ioqKAILmk3cYxFRTQXdAZZB/j8/vjIyDggAwwMM/N6Ph48Zuac8znnDePIm/P5fM6RCIIggIiIiIgMlomuAyAiIiKixsWEj4iIiMjAMeEjIiIiMnBM+IiIiIgMHBM+IiIiIgPHhI+IiIjIwDHhIyIiIjJwTPiIiIiIDJyZrgOg5qGiogK5ublo2bIlJBKJrsMhIiIiDQiCgPz8fLi4uMDEpOZ5
 [...]
+      "text/plain": [
+       "<Figure size 640x480 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "asf_hll_errors = (asf_hll_estimates - true_count) / true_count\n",
+    "datasketch_hll_errors = (ds__hll_estimates - true_count) / true_count\n",
+    "\n",
+    "for arr in [asf_hll_errors, datasketch_hll_errors]:\n",
+    "    arr.sort()\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "error_data = {\"ASF\": asf_hll_errors, \"datasketch\": 
datasketch_hll_errors}\n",
+    "\n",
+    "#\n",
+    "box_plot = ax.boxplot(list(error_data.values()), vert=True)\n",
+    "for median in box_plot['medians']:\n",
+    "    median.set_color('black')\n",
+    "    \n",
+    "\n",
+    "\n",
+    "ax.scatter(np.ones_like(asf_hll_errors), asf_hll_errors,  
marker=\".\")\n",
+    "ax.scatter(2*np.ones_like(datasketch_hll_errors), datasketch_hll_errors, 
marker=\"x\")\n",
+    "ax.set_xticks([1,2], list(error_data.keys()))\n",
+    "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n",
+    "ax.grid()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3755cc1f",
+   "metadata": {},
+   "source": [
+    "The error statistics from the experiment are as follows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "a8b6c959",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "experiment_error_statistics = {\n",
+    "    \"ASF\": {\"median\" : None, \"IQR\" : None}, \n",
+    "    \"datasketch\": {\"median\" : None, \"IQR\" : None}\n",
+    "}\n",
+    "key_list = list(experiment_error_statistics.keys())\n",
+    "\n",
+    "for i, m in enumerate(box_plot[\"medians\"]):\n",
+    "    method_median = m.get_data()[0][0]\n",
+    "    experiment_error_statistics[key_list[i]][\"median\"] = 
method_median\n",
+    "\n",
+    "for i, line in enumerate(box_plot[\"boxes\"]):\n",
+    "    liney = line.get_ydata()\n",
+    "    iqr = liney.max() - liney.min()\n",
+    "    experiment_error_statistics[key_list[i]][\"IQR\"] = iqr"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "35b55d5f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'ASF': {'median': 0.925, 'IQR': 0.009980189698406661},\n",
+       " 'datasketch': {'median': 1.925, 'IQR': 0.015456189289630978}}"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "experiment_error_statistics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "27d5b8ae",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Library     Median      IQR         \n",
+      "------------------------------------\n",
+      "ASF         0.9250      0.0100      \n",
+      "datasketch  1.9250      0.0155      \n",
+      "\n",
+      "\n",
+      "\n",
+      "xChange     Median      IQR         \n",
+      "------------------------------------\n",
+      "            0.4805      0.6457      \n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"{:<12}{:<12}{:<12}\".format(\"Library\", \"Median\", \"IQR\"))\n",
+    "print(\"-\"*36)\n",
+    "for k,vd in experiment_error_statistics.items():\n",
+    "    print(\"{:<12}{:<12.4f}{:<12.4f}\".format(k, vd[\"median\"], 
vd[\"IQR\"]))\n",
+    "    \n",
+    "print(\"\\n\"*2)\n",
+    "print(\"{:<12}{:<12}{:<12}\".format(\"xChange\", \"Median\", \"IQR\"))\n",
+    "print(\"-\"*36)\n",
+    "median_factor = experiment_error_statistics[\"ASF\"][\"median\"] / 
experiment_error_statistics[\"datasketch\"][\"median\"]\n",
+    "iqr_factor = experiment_error_statistics[\"ASF\"][\"IQR\"] / 
experiment_error_statistics[\"datasketch\"][\"IQR\"]\n",
+    "print(\"{:<12}{:<12.4f}{:<12.4f}\".format(\"\", median_factor, 
iqr_factor))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3352f4ed",
+   "metadata": {},
+   "source": [
+    "These tables show that, in this example, the median reported solution 
using ASF DataSketches HLL is about $50\\%$ closer to the true input 
cardinality.  Meanwhile, the interquartile range is about $65\\%$ smaller when 
using `ASF:HLL` rather than `datasketch:HLL`. In other words, the error 
distribution is more tightly concentrated about the true answer."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a615abf",
+   "metadata": {},
+   "source": [
+    "## 3. Other key differences\n",
+    "\n",
+    "There are other crucial differences in each library's API of which users 
should be aware.\n",
+    "\n",
+    "1. The `update()` method for `ASF:HLL` accepts inputs as integers, 
strings, bytes, and floats. On the other hand, `datasketch:HLL` library only 
accepts byte and string type inputs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "60df51ec",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "4.000000029802323"
+      ]
+     },
+     "execution_count": 50,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Datasketches HLL can accept multiple inputs\n",
+    "# These are treated as different items in a single sketch.\n",
+    "asf_hll_types = asf.hll_sketch(14, asf.HLL_8)\n",
+    "asf_hll_types.update(1)\n",
+    "asf_hll_types.update(1.0)\n",
+    "asf_hll_types.update(str(1))\n",
+    "\n",
+    "xx = 1\n",
+    "xx_bytes = xx.to_bytes(64, \"little\")\n",
+    "asf_hll_types.update(xx_bytes)\n",
+    "\n",
+    "asf_hll_types.get_estimate()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 344,
+   "id": "410a3bd4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Exception on integer input\n",
+      "Exception on float input\n",
+      "Accepts bytes input\n",
+      "Accepts string input\n",
+      "2.000122080247517\n"
+     ]
+    }
+   ],
+   "source": [
+    "# datasketch HLL needs bytes\n",
+    "dhll_type = ds.HyperLogLogPlusPlus(14,  hashfunc=lambda x: mmh3.hash64(x, 
signed=False)[0])\n",
+    "try:\n",
+    "    dhll_type.update(1)\n",
+    "except:\n",
+    "    print(\"Exception on integer input\")\n",
+    "    \n",
+    "try:\n",
+    "    dhll_type.update(1.0)\n",
+    "except:\n",
+    "    print(\"Exception on float input\")\n",
+    "    \n",
+    "try:\n",
+    "    dhll_type.update(xx_bytes)\n",
+    "    print(\"Accepts bytes input\")\n",
+    "except:\n",
+    "    print(\"Exception on string input\")\n",
+    "    \n",
+    "try:\n",
+    "    dhll_type.update(str(1))\n",
+    "    print(\"Accepts string input\")\n",
+    "except:\n",
+    "    print(\"Exception on string input\")\n",
+    "    \n",
+    "print(dhll_type.count()) # only two distinct items inserted into the 
sketch."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08c695a5",
+   "metadata": {},
+   "source": [
+    "2. The ASF HLL implementation comes with `get_upper_bound()` and 
`get_lower_bound()` functions.  These enable the user to understand with what 
confidence.  On the other hand, the `datasketch` implementation returns only 
the estimated count.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "64b95f07",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Lower bound (1 std. dev) as % of true value: 99.8909\n",
+      "ASF HyperLogLog estimate as % of true value: 100.5406\n",
+      "Upper bound (1 std. dev) as % of true value: 101.1989\n"
+     ]
+    }
+   ],
+   "source": [
+    "asf_hll_sketch = all_asf_hll[0]\n",
+    "print(f\"Lower bound (1 std. dev) as % of true value: 
{(100*asf_hll_sketch.get_lower_bound(1) / true_count):.4f}\")\n",
+    "print(f\"ASF HyperLogLog estimate as % of true value: 
{(100*asf_hll_sketch.get_estimate() / true_count):.4f}\")\n",
+    "print(f\"Upper bound (1 std. dev) as % of true value: 
{(100*asf_hll_sketch.get_upper_bound(1) / true_count):.4f}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a55fa78",
+   "metadata": {},
+   "source": [
+    "3. In Section 2 that the `ASF:HLL` is substantially faster.  We will 
study this more thoroughly in a later notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "62724771",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/jupyter/comparison-to-datasketch/cardinality_error_experiment.py 
b/jupyter/comparison-to-datasketch/cardinality_error_experiment.py
new file mode 100644
index 0000000..114d49c
--- /dev/null
+++ b/jupyter/comparison-to-datasketch/cardinality_error_experiment.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env python
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import sys
+import argparse
+from datetime import datetime
+import pandas as pd
+import numpy as np
+from utils import distinct_number_sequence
+import datasketches as ds
+import datasketch as d
+import mmh3
+import os
+import matplotlib.pyplot as plt
+
+class ErrorCardinalityProfile:
+    """Generates an experiment comparing the error over different 
cardinalities"""
+    def __init__(self, sketch_lgk:int, lg_trials:int, max_lgN:int):
+        self.sketch_lgk = sketch_lgk
+        self.num_trials = 2**lg_trials
+        self.max_lgN = max_lgN
+        self.max_num_distincts = np.uint64(2 ** self.max_lgN)
+        self.directory_name = "hll_accuracy_profile_" + 
datetime.today().strftime('%Y%m%d')
+        if not os.path.exists(self.directory_name):
+            os.mkdir(self.directory_name)
+        self.file_extension = "_" + datetime.today().strftime('%H%M') + 
f"lgK_{self.sketch_lgk}_lgT_{lg_trials}"
+
+        # Need to remove repeated items for the program logic in self.run()
+        self.plot_points = self._generate_plot_points()
+        self.plot_points.extend(self._generate_plot_points())
+        self.plot_points = list(set(self.plot_points))
+        self.plot_points.sort()
+        print(self.plot_points)
+
+        # Initialise the data structures for results
+        self.DataSketches_results_arr = np.zeros((len(self.plot_points), 
self.num_trials), dtype=float)
+        self.datasketch_results_arr = 
np.zeros_like(self.DataSketches_results_arr)
+        self.DataSketches_results_df = pd.DataFrame(index=self.plot_points, 
columns=None)
+        self.datasketch_results_df = pd.DataFrame(index=self.plot_points, 
columns=None)
+        self.data = np.random.randn(len(self.plot_points), self.num_trials)
+        print("Data shape: ", self.data.shape)
+
+    def _generate_plot_points(self) -> list:
+        """
+        Generates the standard sequence defining the input cardinalites for 
the experiment
+        This is just two points at each power of 2
+        """
+        all_plot_points = []
+        for lgk in range(1, self.max_lgN+1):
+            points = np.unique(np.logspace(start=lgk, stop=lgk+1, num=8, 
endpoint=False, base=2, dtype=np.uint64))
+            all_plot_points.extend(points)
+        all_plot_points.sort()
+        return all_plot_points
+
+    def _bad_hll_range(self) -> list:
+        """Generates 16 logspaced points in the n ≈ 2.5k range."""
+        log2_sketch_threshold = np.log2(2.5* (2**self.sketch_lgk))
+        start = np.floor(log2_sketch_threshold).astype(np.uint64)
+        stop = np.ceil(log2_sketch_threshold).astype(np.uint64)+1
+        points = np.logspace(start, stop, num=50, endpoint=False, base=2, 
dtype=np.uint64)[1:]
+        return points
+
+    def _is_power_of_two(self, a):
+        """Bitwise operations to check value a is a power of two"""
+        return (a & (a-1) == 0) and a != 0
+
+    def _results_to_df(self, start_:int, end_:int, arr:np.array, 
df:pd.DataFrame):
+        """Concatenates the array between columns start_,...end_ - 1 to the 
dataframe"""
+        new_df = pd.DataFrame(arr[:, start_:end_], index=df.index, 
columns=np.arange(start_, end_).tolist())
+        print("concatenating: ", new_df)
+        concat_df = pd.concat([df, new_df], axis=1)
+        return concat_df
+
+    def run(self):
+        """Runs the experiment"""
+        seq_start = np.uint64(2345234)
+        distinct_number = np.uint64(3462)
+        previous_log_trial_index = 0
+        ds_all_results = np.zeros((self.num_trials, len(self.plot_points)))
+        d_all_results = np.zeros_like(ds_all_results)
+
+        for trial in range(1, self.num_trials+1):
+            #print(f"Trial = {trial}\t{self._is_power_of_two(trial)}")
+            # Initialise the sketches
+            hll = ds.hll_sketch(self.sketch_lgk, ds.HLL_8)
+            h = d.HyperLogLogPlusPlus(p=self.sketch_lgk, hashfunc=lambda x: 
mmh3.hash64(x, signed=False)[0])
+            plot_point_index = 0  # Return to the start of the plot points 
list to generate the data
+            plot_point_value = self.plot_points[plot_point_index]
+            total_updates = 0
+            seq_start += distinct_number  # Start a new input sequence
+
+            # Temporary result data structure
+            ds_results = np.zeros((len(self.plot_points),))
+            d_results = np.zeros_like(ds_results)
+
+
+            for new_number in distinct_number_sequence(seq_start):
+                hll.update(new_number)
+                h.update(str(new_number).encode('utf8'))
+                total_updates += 1
+                #print(f"Trial: {trial:<5} updates: {total_updates:<5}Index: 
{plot_point_index:<5} Value: {plot_point_value:<5}\n")
+                if total_updates == plot_point_value:
+                    ds_results[plot_point_index] = (hll.get_estimate() - 
plot_point_value) / plot_point_value
+                    d_results[plot_point_index] = (h.count() - 
plot_point_value) / plot_point_value
+                    plot_point_index += 1
+                    #print(f"PPI:{plot_point_index:<3}PPV:{plot_point_value}")
+
+                    if plot_point_index < len(self.plot_points):
+                        plot_point_value = self.plot_points[plot_point_index]
+                    else:
+                        #print("Already at the end")
+                        break
+
+            # After the break statement, control returns here.  Now need to 
decide whether to write or continue.
+            ds_all_results[trial-1, :] = ds_results # subtract 1 as we use 
1-based indexing for the trial count.
+            d_all_results[trial - 1, :] = d_results  # subtract 1 as we use 
1-based indexing for the trial count.
+            if self._is_power_of_two(trial) and trial > 1:
+                # write the array only a logarithmic number of times
+                temporary_ds_results = ds_all_results[0:trial, : ]
+                temporary_d_results = d_all_results[0:trial, :]
+                print(f"#################### PARTIAL RESULTS FOR {trial} 
TRIALS: DATASKETCHES ####################")
+                previous_log_trial_index = trial
+                self.DataSketches_results_df = 
pd.DataFrame(temporary_ds_results.T, index=self.DataSketches_results_df.index, 
columns=np.arange(trial).tolist())
+                self.DataSketches_results_df.to_csv(
+                    self.directory_name + "/DataSketches_hll" + 
self.file_extension + f"trials_{trial}.csv",
+                    index_label="n")
+                self.datasketch_results_df = 
pd.DataFrame(temporary_d_results.T,
+                                                            
index=self.datasketch_results_df.index,
+                                                            
columns=np.arange(trial).tolist())
+                self.datasketch_results_df.to_csv(
+                    self.directory_name + "/datasketch_hll" + 
self.file_extension + f"trials_{trial}.csv",
+                    index_label="n"
+                )
+                print(self.DataSketches_results_df)
+
+                print(f"#################### PARTIAL RESULTS FOR {trial} 
TRIALS: datasketch ####################")
+                print(self.datasketch_results_df)
+
+
+def main(argv):
+    assert len(argv) == 3
+    SKETCH_LGK = argv['lgk']
+    LG_TRIALS = argv['lgt']
+    MAX_LG_N = argv['lgN']  # FOR SETTING NUMBER OF DISTINCTS
+    experiment = ErrorCardinalityProfile(SKETCH_LGK, LG_TRIALS, MAX_LG_N)
+    experiment.run()
+
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Error-cardinality profile of 
HyperLogLog')
+    parser.add_argument('-lgk', '--lgk', help='Log2(k) value for the number of 
buckets in the sketch.',type=int, required=True)
+    parser.add_argument('-lgt', '--lgt', help='Log2(T) value for number of 
trials t.',type=int, required=True)
+    parser.add_argument('-lgN', '--lgN', help='Largest permissible log2(.) 
value for input size.',type=int, required=True)
+    args = vars(parser.parse_args())
+    main(args)
diff --git a/jupyter/comparison-to-datasketch/single-sketch-accuracy.ipynb 
b/jupyter/comparison-to-datasketch/single-sketch-accuracy.ipynb
new file mode 100644
index 0000000..b752c7f
--- /dev/null
+++ b/jupyter/comparison-to-datasketch/single-sketch-accuracy.ipynb
@@ -0,0 +1,762 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b0bac2b9-77d1-435d-bd66-93c986ccddb6",
+   "metadata": {},
+   "source": [
+    "# Single Sketch Error: ASF DataSketches vs datasketch\n",
+    "\n",
+    "We compare implementations of the HyperLogLog algorithm from the [Apache 
Software Foundation 
DataSketches](https://github.com/apache/datasketches-cpp/tree/master/python) 
library and the open-source python 
[datasketch](https://github.com/ekzhu/datasketch).  Both libraries present 
python implementations or bindings for common \"sketching\" algorithms.\n",
+    "In the writing we abbreviate the two implementations as `ASF:HLL` for the 
ASF DataSketches \n",
+    "and `datasketch:HLL` for the other implementation.\n",
+    "\n",
+    "This notebook needs easily available libraries that are available on 
PyPi."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "66bf431c-04bd-412b-8930-3741560ba0a7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "%matplotlib inline\n",
+    "import numpy as np\n",
+    "import datasketches as asf\n",
+    "import datasketch as ds\n",
+    "import mmh3"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "3d43ab6d-1774-4a37-a91b-817fb498c5a0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Plotting parameters\n",
+    "method_plot_params = {\n",
+    "    \"asf\" : {\"color\": \"C0\", \"marker\": '.'},\n",
+    "    \"datasketch\" : {\"color\": \"C1\", \"marker\": \"^\"}\n",
+    "}\n",
+    "asf_color = method_plot_params[\"asf\"][\"color\"]\n",
+    "ds__color = method_plot_params[\"datasketch\"][\"color\"]\n",
+    "q90_ls = \"--\"\n",
+    "\n",
+    "params = {'legend.fontsize': 'x-large',\n",
+    "     'axes.labelsize': 'x-large',\n",
+    "     'axes.titlesize':'x-large',\n",
+    "     'xtick.labelsize':'x-large',\n",
+    "     'ytick.labelsize':'x-large',\n",
+    "      \"lines.linewidth\": 2.5}\n",
+    "plt.rcParams.update(params)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1f86685-363c-44f9-888a-6f3d4a6459cf",
+   "metadata": {},
+   "source": [
+    "### <a name=\"error_vs_cardinality\"></a>1. Error vs Cardinality \n",
+    "We study the error behaviour as the input cardinality increases.  We 
provide an initial comparison to highlight the differences in the 
estimators.\n",
+    "These plots are crucial to understanding error distributions of 
sketches.\n",
+    "\n",
+    "To generate the synthetic data, we use \"Fibonacci Hashing\" as a cheap 
way to generate a pseudorandom sequence.  This process starts with an initial 
selection of a $64$-bit integer.  Then, for every new item that must be 
generated, we add the full $64$ bit range scaled by the integer golden ratio so 
that every other update _intentionally_ overflows and maps once more back into 
the $64$ bit range.\n",
+    "\n",
+    "#### 1a. Single sketch estimation\n",
+    "\n",
+    "From the same directory as this notebook, run\n",
+    "\n",
+    "```./cardinality_error_experiment.py -lgk 14 -lgt 7  -lgN 20``` \n",
+    "\n",
+    "which generates $8$ \"plot points\" between every power of $2$ not 
exceeding $N = 2^{21}$.  We fix `-lgt 7` for $128$ trials and use a sketch size 
of with $2^{14}$.  For every trial, a fresh sketch is initialised.  The results 
are saved in a directory `hll_accuracy_profile_yyymmdd/...`\n",
+    "\n",
+    "The two sketches we compare are HyperLogLog sketches with $2^{14}$ 
buckets:\n",
+    "\n",
+    "*ASF DataSketches Sketch Setup*\n",
+    "- `hll = asf.hll_sketch(14)`\n",
+    "\n",
+    "*datasketch sketch setup*\n",
+    "- `h = ds.HyperLogLogPlusPlus(p=14, hashfunc=lambda x: mmh3.hash64(x, 
signed=False)[0])`\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "433f73f7-7844-448c-b7fd-16122a3a8b4c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "asf_errors = 
pd.read_csv(\"hll_accuracy_profile_20230504/DataSketches_hll_1516lgK_14_lgT_8trials_256.csv\",
 index_col=0)\n",
+    "ds__errors = 
pd.read_csv(\"hll_accuracy_profile_20230504/datasketch_hll_1516lgK_14_lgT_8trials_256.csv\",
 index_col=0)\n",
+    "\n",
+    "lgk = 14"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 41,
+   "id": "6e8fb853-1e79-4d2a-93a2-5e98c9e3c1b3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Text(0.5, 0, 'Input cardinality $n$')"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    },
+    {
+     "data": {
+      "image/png": 
"iVBORw0KGgoAAAANSUhEUgAABCIAAAMKCAYAAABQieWGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/SrBM8AAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3gUVdvA4d+m90aHBEJJ6L33XpUioAiIIF2KYvmsrwIqtlfQABYQCUVfRIr03iF0kV4CIRBqAgkphPTM98ewLbubRkjjua8rF5mzZ2bObGaHnWfOeY5GURQFIYQQQgghhBBCiHxgVdANEEIIIYQQQgghxLNDAhFCCCGEEEIIIYTINxKIEEIIIYQQQgghRL6RQIQQQgghhBBCCCHyjQQihBBCCCGEEEIIkW8kECGEEEIIIYQQQoh8I4EIIYQQQgghhBBC5Bubgm6AEMJUSkoKaWlpBd0MIYQQQggh
 [...]
+      "text/plain": [
+       "<Figure size 1200x800 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "fig, ax = plt.subplots(figsize=(12,8))\n",
+    "\n",
+    "methods = [\"ASF\", \"datasketch\"]\n",
+    "\n",
+    "for i, (method, colour, df) in enumerate(zip(methods, [asf_color, 
ds__color], [asf_errors, ds__errors])):\n",
+    "    xn = df.index \n",
+    "    median = df.median(axis=1)\n",
+    "    q95 = df.quantile(q=0.977725, axis=1)\n",
+    "    q05 = df.quantile(q=0.022275, axis=1) # df.mean(axis=1) - 
df.std(axis=1)\n",
+    "    ax.plot(xn, median,\n",
+    "           color=colour, label=method+\": median\")\n",
+    "    ax.plot(xn, q95,\n",
+    "           color=colour, linestyle=q90_ls)\n",
+    "    ax.plot(xn, q05,\n",
+    "           color=colour, linestyle=q90_ls, label=method+\": 90% CI\")\n",
+    "\n",
+    "ax.plot(xn, 2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n",
+    "        color=\"red\", lw=3., linestyle=\":\", label=\"q = 0.97725\")\n",
+    "ax.plot(xn, -2.04/np.sqrt(1<<lgk)*np.ones_like(xn), \n",
+    "        color=\"red\", lw=3., linestyle=\"-.\",  label=\"q=0.2275\")\n",
+    "\n",
+    "ax.set_xscale('log', base=10)\n",
+    "ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),\n",
+    "          ncol=3, fancybox=True)\n",
+    "ax.grid()\n",
+    "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n",
+    "ax.set_xlabel(r\"Input cardinality $n$\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c253d24a-9a21-4471-ae25-5c494a1d4dd6",
+   "metadata": {},
+   "source": [
+    "**Experiment Summary**\n",
+    "\n",
+    "The plotted red lines represent $\\pm 2 \\sigma$ where, for HLL, the 
standard error of the estimator is $\\sigma = 1.04 / \\sqrt{k}$. \n",
+    "Hence, the area between the red lines is a confidence interval.  We have 
chosen the quantiles \n",
+    "$q = 0.022275, 0.977725$ so that the area between the red lines is 
approximately a $95\\%$ confidence interval.\n",
+    "This test uses a number of buckets $k = 2^{14}$.  Changing the number of 
buckets in the sketch alters the standard error so the red lines, and the 
behaviour of the error curves at quantile level $q$, would change 
appropriately.\n",
+    "\n",
+    "\n",
+    "Both implementations have a median that is centered about an error of 
$0$, suggesting that they return unbiased estimators of the true cardinality.  
At small cardinalities, `asf.hll_sketch` is vastly better than the 
`datasketch.HyperLogLogPlusPlus`.  This is because it transitions through 
various estimators to ensure small error.  All of the curves for `ASF:HLL` have 
zero error until about $n = 10^3$ is in sparse mode and an exact estimator can 
be used.  Without this behavior and re [...]
+    "\n",
+    "At large cardinalities, _for a single sketch_, `asf.hll_sketch` uses the 
HIP estimator which mildly improves on the standard large cardinality 
estimator.  However, when we need to merge sketches later on, this behaviour 
will revert to the standard HLL estimator.\n",
+    "\n",
+    "### 2.  Single sketch estimation: Real data\n",
+    "Now let's see what happens when we want sketch some real data.  \n",
+    "We will download an opensource dataset from the [UCI Machine Learning 
Repository](https://archive.ics.uci.edu/ml/datasets/BitcoinHeistRansomwareAddressDataset#).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "id": "1131e64e-cd98-4524-a33a-fd97eea1816d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  
Current\n",
+      "                                 Dload  Upload   Total   Spent    Left  
Speed\n",
+      "100  110M    0  110M    0     0  1150k      0 --:--:--  0:01:38 
--:--:-- 1231k8k      0 --:--:--  0:00:23 --:--:-- 1164k 0     0   913k      0 
--:--:--  0:00:30 --:--:-- 1065k 1193k      0 --:--:--  0:01:09 --:--:-- 
1544k\n"
+     ]
+    }
+   ],
+   "source": [
+    "!curl -o bitcoin.zip 
\"https://archive.ics.uci.edu/ml/machine-learning-databases/00526/data.zip\"";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "id": "f2ee8d0e-cad9-4d61-8ff1-4e0a7db66aae",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Archive:  bitcoin.zip\n",
+      "  inflating: BitcoinHeistData.csv    \n"
+     ]
+    }
+   ],
+   "source": [
+    "!unzip bitcoin.zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "033e8d4b-4ac9-4260-a74c-6aaafd4b0552",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>address</th>\n",
+       "      <th>year</th>\n",
+       "      <th>day</th>\n",
+       "      <th>length</th>\n",
+       "      <th>weight</th>\n",
+       "      <th>count</th>\n",
+       "      <th>looped</th>\n",
+       "      <th>neighbors</th>\n",
+       "      <th>income</th>\n",
+       "      <th>label</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>111K8kZAEnJg245r2cM6y9zgJGHZtJPy6</td>\n",
+       "      <td>2017</td>\n",
+       "      <td>11</td>\n",
+       "      <td>18</td>\n",
+       "      <td>0.008333</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>100050000.0</td>\n",
+       "      <td>princetonCerber</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>1123pJv8jzeFQaCV4w644pzQJzVWay2zcA</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>132</td>\n",
+       "      <td>44</td>\n",
+       "      <td>0.000244</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>100000000.0</td>\n",
+       "      <td>princetonLocky</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>112536im7hy6wtKbpH1qYDWtTyMRAcA2p7</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>246</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>200000000.0</td>\n",
+       "      <td>princetonCerber</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>322</td>\n",
+       "      <td>72</td>\n",
+       "      <td>0.003906</td>\n",
+       "      <td>1</td>\n",
+       "      <td>0</td>\n",
+       "      <td>2</td>\n",
+       "      <td>71200000.0</td>\n",
+       "      <td>princetonCerber</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>1129TSjKtx65E35GiUo4AYVeyo48twbrGX</td>\n",
+       "      <td>2016</td>\n",
+       "      <td>238</td>\n",
+       "      <td>144</td>\n",
+       "      <td>0.072848</td>\n",
+       "      <td>456</td>\n",
+       "      <td>0</td>\n",
+       "      <td>1</td>\n",
+       "      <td>200000000.0</td>\n",
+       "      <td>princetonLocky</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                              address  year  day  length    weight  
count  \\\n",
+       "0   111K8kZAEnJg245r2cM6y9zgJGHZtJPy6  2017   11      18  0.008333     
 1   \n",
+       "1  1123pJv8jzeFQaCV4w644pzQJzVWay2zcA  2016  132      44  0.000244     
 1   \n",
+       "2  112536im7hy6wtKbpH1qYDWtTyMRAcA2p7  2016  246       0  1.000000     
 1   \n",
+       "3  1126eDRw2wqSkWosjTCre8cjjQW8sSeWH7  2016  322      72  0.003906     
 1   \n",
+       "4  1129TSjKtx65E35GiUo4AYVeyo48twbrGX  2016  238     144  0.072848    
456   \n",
+       "\n",
+       "   looped  neighbors       income            label  \n",
+       "0       0          2  100050000.0  princetonCerber  \n",
+       "1       0          1  100000000.0   princetonLocky  \n",
+       "2       0          2  200000000.0  princetonCerber  \n",
+       "3       0          2   71200000.0  princetonCerber  \n",
+       "4       0          1  200000000.0   princetonLocky  "
+      ]
+     },
+     "execution_count": 44,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bitcoin_df = pd.read_csv(\"BitcoinHeistData.csv\", header=0)\n",
+    "bitcoin_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba7fc55f-7214-4b20-a6c7-822207bb5922",
+   "metadata": {},
+   "source": [
+    "Let's focus on the simple task of counting how many unique addresses are 
present in the dataset.\n",
+    "With native pandas functionality, we see that there are about $2.6$ 
million unique addresses.\n",
+    "We will use HLL sketches to estimate this count."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "52a71365-d54d-4295-b714-4ce6201be6c6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 811 ms, sys: 33.8 ms, total: 845 ms\n",
+      "Wall time: 846 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "true_count = bitcoin_df[\"address\"].nunique()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "07041390-365d-4dff-91cd-7f20bc50273c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "There are 2631095 unique addresses\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"There are {true_count} unique addresses\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cded8f4-1d0c-4dcc-b4cd-73015394a925",
+   "metadata": {},
+   "source": [
+    "Now define equivalent sketches from both libraries.\n",
+    "We use $2^{14}$ buckets and $8$-bit HyperLogLog sketches for each 
implementation.  The `datsketch:HLL` uses the MurmurHash library so that we 
have equivalent sketches for comparison. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "2bd9ab89-a705-4040-b14a-014a18fff9bb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "asf_hll = asf.hll_sketch(14, asf.HLL_8)\n",
+    "ds_hll = ds.HyperLogLogPlusPlus(p=14,  hashfunc=lambda x: mmh3.hash64(x, 
signed=False)[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "356e152b-47f1-48d9-ae2f-bc5bd17ad244",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 2.08 s, sys: 8.87 ms, total: 2.08 s\n",
+      "Wall time: 2.09 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for ad in bitcoin_df[\"address\"]:\n",
+    "    asf_hll.update(ad)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "3540529f-e8ed-4f20-98c3-1f5dabf8ef12",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 3.65 s, sys: 10.4 ms, total: 3.66 s\n",
+      "Wall time: 3.66 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for ad in bitcoin_df[\"address\"]:\n",
+    "    ds_hll.update(ad)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2d3e4f3-b878-4945-a3dd-5fe513a86a6a",
+   "metadata": {},
+   "source": [
+    "On this simple example we see that the datasketches implementation takes 
about $2$ seconds compared to almost $4$ for `datasketch`.  Note that these 
times are longer than for the native Pandas call to nunique; this is not a 
problem because, unlike `pd.nunique(.)` the sketches are designed for large 
datasets not entirely held in memory.  The figures quoted are on a $2023$ Apple 
Macbook Pro with Apple M1 Pro and absolute numbers may differ slightly.\n",
+    "\n",
+    "For estimation accuracy, we have the following behaviour for the single 
sketches.  Since we have the true count, we can also evaluate the error."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "id": "d529da07-3752-4685-8203-7dce6dbfa892",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "ASF estimate: 2650083.4660\n",
+      "ASF error: 0.7217  %\n"
+     ]
+    }
+   ],
+   "source": [
+    "# DataSketches\n",
+    "print(f\"ASF estimate: {asf_hll.get_estimate():.4f}\")\n",
+    "print(f\"ASF error: 
{100*(asf_hll.get_estimate()-true_count)/true_count:.4f}  %\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 51,
+   "id": "71531fc9-8191-4f0d-83a3-e2ccee38d56c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "datasketch estimate: 2646133.7361\n",
+      "datasketch error: 0.5716  %\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Datasketch\n",
+    "print(f\"datasketch estimate: {ds_hll.count():.4f}\")\n",
+    "print(f\"datasketch error: {100*(ds_hll.count() - 
true_count)/true_count:.4f}  %\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2b08dd1-00c1-4f15-93bb-51ba60f23a69",
+   "metadata": {},
+   "source": [
+    "On this example, the datasketch implementation has a lower error than the 
ASF method.  However, this was a single sketch so we cannot draw any strong 
conclusions.  Rather, we would have to study the error distribution as 
previously done in [Section 1](error_vs_cardinality).  \n",
+    "\n",
+    "We run $25$ independent trials of each algorithm, each trial with a fresh 
sketch.\n",
+    "Since HLL is deterministic given the hash seed, with no change to the 
input we would obtain the same output every time.  To avoid this, we prepend 
the trial number to every incoming string so that the number of unique items 
remains the same but the streams are superficially different."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 52,
+   "id": "1c5a7c42-cdf7-4632-a96d-c23a7bccc95e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lgk = 14\n",
+    "num_trials = 25\n",
+    "\n",
+    "all_asf_hll = [asf.hll_sketch(14, asf.HLL_8) for _ in 
range(num_trials)]\n",
+    "all_ds_hll = [ds.HyperLogLogPlusPlus(p=14,  hashfunc=lambda x: 
mmh3.hash64(x, signed=False)[0]) for _ in range(num_trials)]\n",
+    "\n",
+    "asf_hll_estimates = np.zeros((num_trials,), dtype=float)\n",
+    "asf_hll_errors = np.zeros_like(asf_hll_estimates)\n",
+    "ds__hll_estimates = np.zeros_like(asf_hll_estimates)\n",
+    "ds__hll_errors = np.zeros_like(asf_hll_estimates)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "id": "e51ee247-6f96-4418-a732-3f2f7d77dc87",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 57.9 s, sys: 131 ms, total: 58 s\n",
+      "Wall time: 58.2 s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for trial in range(num_trials):\n",
+    "    for ad in bitcoin_df[\"address\"]:\n",
+    "        all_asf_hll[trial].update(str(trial) + ad)\n",
+    "    asf_hll_estimates[trial] = all_asf_hll[trial].get_estimate()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 54,
+   "id": "768248dc-43d9-41d6-a379-7c9a7b60eb6d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 1min 41s, sys: 266 ms, total: 1min 41s\n",
+      "Wall time: 1min 41s\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%time\n",
+    "for trial in range(num_trials):\n",
+    "    for i, ad in enumerate(bitcoin_df[\"address\"]):\n",
+    "        all_ds_hll[trial].update(str(trial) + ad)\n",
+    "    ds__hll_estimates[trial] = all_ds_hll[trial].count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1303e34-cc8f-4975-ade1-4f68b0e006d7",
+   "metadata": {},
+   "source": [
+    "The ASF HLL runs in about half of the time as the datasketch 
implementation.\n",
+    "However, we are also interested in the distribution of errors for each 
sketch implementation.\n",
+    "Since we have fewer trials than in Section 1, we plot a box and whisker 
diagram which is still useful in understanding the error distribution, despite 
being less informative about the full error distribution than the pitchfork 
plots from Section 1.  The plot can be interpreted as a cross-section of the 
pitchfork plot at the vertical line $n = 2631095$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 55,
+   "id": "33a089c7-f8b5-4320-a58e-17f6e590b1a8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": 
"iVBORw0KGgoAAAANSUhEUgAAAnwAAAGiCAYAAAB9MRuUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjguMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/SrBM8AAAACXBIWXMAAA9hAAAPYQGoP6dpAABzy0lEQVR4nO3deVxU1f8/8NewLwqCaIIgCIpSmvsGiYofcCs1IbWPWmqWWq5UP7VFxFyqT2lamqWFpn7LRAP8aAQfVFBQBJe0EsdYVERzQWeQdYD7++PKyDggAwzLzLyejwePmTnn3HPeMI68OffecySCIAggIiIiIr1l1NQBEBEREVHDYsJHREREpOeY8BERERHpOSZ8RERERHqOCR8RERGRnmPCR0RERKTnmPARERER6TkmfERERER6zqSpA6Dmoby8HDk5OWjZsiUkEklTh0NEREQaEAQBeXl5cHJygpFR9fN4
 [...]
+      "text/plain": [
+       "<Figure size 640x480 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "asf_hll_errors = (asf_hll_estimates - true_count) / true_count\n",
+    "datasketch_hll_errors = (ds__hll_estimates - true_count) / true_count\n",
+    "\n",
+    "for arr in [asf_hll_errors, datasketch_hll_errors]:\n",
+    "    arr.sort()\n",
+    "\n",
+    "fig, ax = plt.subplots()\n",
+    "error_data = {\"ASF\": asf_hll_errors, \"datasketch\": 
datasketch_hll_errors}\n",
+    "\n",
+    "#\n",
+    "box_plot = ax.boxplot(list(error_data.values()), vert=True)\n",
+    "for median in box_plot['medians']:\n",
+    "    median.set_color('black')\n",
+    "    \n",
+    "\n",
+    "\n",
+    "ax.scatter(np.ones_like(asf_hll_errors), asf_hll_errors,  
marker=\".\")\n",
+    "ax.scatter(2*np.ones_like(datasketch_hll_errors), datasketch_hll_errors, 
marker=\"x\")\n",
+    "ax.set_xticks([1,2], list(error_data.keys()))\n",
+    "ax.set_ylabel(r\"Relative Error: $ \\frac{n - \\hat{n}}{n}$\")\n",
+    "ax.grid()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6611561e-75ce-4a2e-a6a1-00b8cdef81f5",
+   "metadata": {},
+   "source": [
+    "The error statistics from the experiment are as follows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 62,
+   "id": "350a4171-0505-48e1-ac2c-98afa05387b6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "experiment_error_statistics = {\n",
+    "    \"ASF\": {\"median\" : None, \"IQR\" : None}, \n",
+    "    \"datasketch\": {\"median\" : None, \"IQR\" : None}\n",
+    "}\n",
+    "key_list = list(experiment_error_statistics.keys())\n",
+    "\n",
+    "for i, m in enumerate(box_plot[\"medians\"]):\n",
+    "    method_median = m.get_data()[1][0]\n",
+    "    experiment_error_statistics[key_list[i]][\"median\"] = 
method_median\n",
+    "\n",
+    "for i, line in enumerate(box_plot[\"boxes\"]):\n",
+    "    liney = line.get_ydata()\n",
+    "    iqr = liney.max() - liney.min()\n",
+    "    experiment_error_statistics[key_list[i]][\"IQR\"] = iqr"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 61,
+   "id": "09f7c6eb-ff32-4679-ac0e-2d81bcd95cf1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'ASF': {'median': 0.0015215127060363715, 'IQR': 
0.009980189698406661},\n",
+       " 'datasketch': {'median': 0.0025901816561035075, 'IQR': 
0.015456189289630978}}"
+      ]
+     },
+     "execution_count": 61,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "experiment_error_statistics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 70,
+   "id": "c3cf31cb-9769-4e2b-9645-10a502ad5d6b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Library       Median Error           IQR\n",
+      "----------------------------------------\n",
+      "ASF                 0.0015        0.0100\n",
+      "datasketch          0.0026        0.0155\n",
+      "\n",
+      "\n",
+      "\n",
+      "ErrorxChange  Median Error           IQR\n",
+      "----------------------------------------\n",
+      "                    0.5874        0.6457\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"{:<12}{:>14}{:>14}\".format(\"Library\", \"Median Error\", 
\"IQR\"))\n",
+    "print(\"-\"*40)\n",
+    "for k,vd in experiment_error_statistics.items():\n",
+    "    print(\"{:<12}{:>14.4f}{:>14.4f}\".format(k, vd[\"median\"], 
vd[\"IQR\"]))\n",
+    "    \n",
+    "print(\"\\n\"*2)\n",
+    "print(\"{:<12}{:>14}{:>14}\".format(\"ErrorxChange\", \"Median Error\", 
\"IQR\"))\n",
+    "print(\"-\"*40)\n",
+    "median_factor = experiment_error_statistics[\"ASF\"][\"median\"] / 
experiment_error_statistics[\"datasketch\"][\"median\"]\n",
+    "iqr_factor = experiment_error_statistics[\"ASF\"][\"IQR\"] / 
experiment_error_statistics[\"datasketch\"][\"IQR\"]\n",
+    "print(\"{:<12}{:>14.4f}{:>14.4f}\".format(\"\", median_factor, 
iqr_factor))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbb58e63-cbba-43ac-a015-05b9471b6212",
+   "metadata": {},
+   "source": [
+    "The columns in the second table above are found by taking the ratio of 
the smaller ASF results and the larger datasketch results.  \n",
+    "They indicate that the median error in using the ASF library is about 
$58\\%$ of that incurred when using the datasketch library and the 
interquartile range is tighter, being about $65\\%$ as large as that from the 
`datasketch` implementation."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/jupyter/comparison-to-datasketch/utils.py 
b/jupyter/comparison-to-datasketch/utils.py
new file mode 100644
index 0000000..d69e892
--- /dev/null
+++ b/jupyter/comparison-to-datasketch/utils.py
@@ -0,0 +1,37 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import numpy as np
+
+def distinct_number_sequence(start:np.uint64=0) -> np.uint64:
+    """Generator function to make 64-bit numbers that are distinct."""
+    assert isinstance(start, np.uint64)
+    num = start
+    golden_ratio = np.uint64(0x9e3779b97f4a7c13)
+    while True:
+        yield num
+        num += np.uint64(golden_ratio)
+
+if __name__ == '__main__':
+    start = np.uint64(2345234635)
+    ndistincts = np.uint64(10)
+    count = 0
+    for i in distinct_number_sequence(start):
+        print(i, end="\n")
+        count += 1
+        if count == ndistincts:
+            break


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datasketches-python) 01/01: Initial error comparison

Reply via email to