Re: [PR] 202307-notebook Count Distinct (inc HLL + Theta) (druid)

via GitHub Fri, 07 Jul 2023 16:55:08 -0700


techdocsmith commented on code in PR #14523:
URL: https://github.com/apache/druid/pull/14523#discussion_r1256607802



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",

Review Comment:
   Why is this bold?



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample 
data, using defaults, into the table 
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",

Review Comment:
   we don't want to just give them an API to load this in the tutorial? as a 
part of the setup or something? Would also give the opportunity to make the 
table name a little shorter / easier to read / less error prone?



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample 
data, using defaults, into the table 
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "* A running Druid instance.\n",
+    "   * Update the `druid_host` variable to point to your Router endpoint. 
For example, `druid_host = \"http://localhost:8888\"`.\n";,
+    "* The following Python packages:\n",
+    "   * `druidapi`, a Python client for Apache Druid\n",
+    "\n",
+    "To start this tutorial, run the next cell. It defines variables for two 
datasources and the Druid host the tutorial uses. The quickstart deployment 
configures Druid to listen on port `8888` by default, so you'll make API calls 
against `http://localhost:8888`.\n";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "display.tables('INFORMATION_SCHEMA')"

Review Comment:
   i don't think we need the information schema here



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample 
data, using defaults, into the table 
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "* A running Druid instance.\n",
+    "   * Update the `druid_host` variable to point to your Router endpoint. 
For example, `druid_host = \"http://localhost:8888\"`.\n";,
+    "* The following Python packages:\n",
+    "   * `druidapi`, a Python client for Apache Druid\n",
+    "\n",
+    "To start this tutorial, run the next cell. It defines variables for two 
datasources and the Druid host the tutorial uses. The quickstart deployment 
configures Druid to listen on port `8888` by default, so you'll make API calls 
against `http://localhost:8888`.\n";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "display.tables('INFORMATION_SCHEMA')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+   "metadata": {},
+   "source": [
+    "## COUNT(DISTINCT) queries on basic datasets\n",
+    "\n",
+    "Here's a very simple query to find the number of distinct Tail Numbers in 
the example dataset.\n",
+    "\n",
+    "```sql\n",
+    "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS 
\"Events\"\n",
+    "FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95a8d8bf-69fa-4266-b171-cb550009e89e",
+   "metadata": {},
+   "source": [
+    "### Running COUNT(DISTINCT) with approximation\n",
+    "\n",
+    "Druid will automatically look for patterns of query that can make use of 
approximation. In this instance, Druid will identify a match for approximate 
`COUNT(DISTINCT)`.\n",
+    "\n",
+    "Each data server computes its own intermediate results for merging into 
the final result set. In a `COUNT(DISTINCT)` query, that means that they will 
create their own count inside a representation called a [data 
sketch](https://datasketches.apache.org/). These much smaller objects are then 
merged together when the query results are finalized, rather than Druid having 
to combine the individual lists of distinct values from each process.\n",
+    "\n",
+    "This translates into much faster query execution, especially when the 
intermediate results are large – say when there are a lot of unique values in 
the source data.\n",
+    "\n",
+    "It also means that the most scalable part of Druid – the individual data 
servers – do much more of the work, and do so earlier, instead of leaving it to 
the merge stage.\n",
+    "\n",
+    "> Approximations improve scalability, storage, and memory use - at the 
cost of some error.\n",
+    "> \n",
+    "> _[Gian Merlino](https://github.com/gianm)_\n",
+    "\n",
+    "Let's run this query with all of Druid's defaults to see what the results 
are like. (We can safely omit a `__time` filter thanks to the tiny size of the 
example dataset.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS 
\"Events\"\n",
+    "FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2\n",
+    "'''\n",
+    "df = pd.DataFrame(sql_client.sql(sql))\n",
+    "\n",
+    "df.plot.bar(x='Reporting_Airline', y='Events')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f37d854-efd9-401a-8726-9949bff0c012",
+   "metadata": {},
+   "source": [
+    "### Running COUNT(DISTINCT) without approximation\n",
+    "\n",
+    "We can supply a query context parameter, `useApproximateCountDistinct`, 
to force Druid to not use approximation. We won't get the speed boost afforded 
by the sketching approach – but that's OK because the example dataset is so 
small! It would be a different story if `Tail_Number` had high cardinality - 
like if it was IP Addresses or User Identifiers."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "652988ac-c256-46d4-a4ea-dbcf0e023991",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS 
\"Events\"\n",
+    "FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2\n",
+    "'''\n",
+    "\n",
+    "req = sql_client.sql_request(sql)\n",
+    "req.add_context(\"useApproximateCountDistinct\", \"false\")\n",
+    "resp = sql_client.sql_query(req)\n",
+    "\n",
+    "df = pd.DataFrame(resp.rows)\n",
+    "df.plot.bar(x='Reporting_Airline', y='Events')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08c91329-8d05-46eb-8c19-5eaf9043dcb6",
+   "metadata": {},
+   "source": [
+    "### Comparing approximate and non-approximate results\n",
+    "\n",
+    "On the surface, these do not _look_ different. And, in a lot of user 
interfaces, that's perfectly fine!\n",
+    "\n",
+    "But let's go a bit deeper and see how the results actually differ."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c05a031f-a805-45dd-935b-d8af808041a6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS 
\"Events\"\n",
+    "FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2\n",
+    "'''\n",
+    "\n",
+    "req = sql_client.sql_request(sql)\n",
+    "req.add_context(\"useApproximateCountDistinct\", \"false\")\n",
+    "resp = sql_client.sql_query(req)\n",
+    "\n",
+    "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+    "df2 = pd.DataFrame(resp.rows)\n",
+    "\n",
+    "df3 = df1.compare(df2, keep_equal=True)\n",
+    "df3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8f3320d-d4ec-460a-b1fc-59c98f85cc3a",
+   "metadata": {},
+   "source": [
+    "There are _value_ errors, as you might expect with approximation. This 
therefore affects _ordering_ of results.\n",
+    "\n",
+    "_Error in sketch-based approximation is probabilistic, rather than 
guaranteed. That's to say that a certain percentage of the time you can expect 
the measurements you take to be within a certain distance of the true value. 
Also, their size is not dependent on the data – the default size of a sketch in 
Druid is just over 2000 bytes._\n",

Review Comment:
   Why is this in italics?



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample 
data, using defaults, into the table 
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "* A running Druid instance.\n",
+    "   * Update the `druid_host` variable to point to your Router endpoint. 
For example, `druid_host = \"http://localhost:8888\"`.\n";,
+    "* The following Python packages:\n",
+    "   * `druidapi`, a Python client for Apache Druid\n",
+    "\n",
+    "To start this tutorial, run the next cell. It defines variables for two 
datasources and the Druid host the tutorial uses. The quickstart deployment 
configures Druid to listen on port `8888` by default, so you'll make API calls 
against `http://localhost:8888`.\n";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "display.tables('INFORMATION_SCHEMA')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+   "metadata": {},
+   "source": [
+    "## COUNT(DISTINCT) queries on basic datasets\n",
+    "\n",
+    "Here's a very simple query to find the number of distinct Tail Numbers in 
the example dataset.\n",
+    "\n",
+    "```sql\n",
+    "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS 
\"Events\"\n",
+    "FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95a8d8bf-69fa-4266-b171-cb550009e89e",
+   "metadata": {},
+   "source": [
+    "### Running COUNT(DISTINCT) with approximation\n",
+    "\n",
+    "Druid will automatically look for patterns of query that can make use of 
approximation. In this instance, Druid will identify a match for approximate 
`COUNT(DISTINCT)`.\n",

Review Comment:
   ```suggestion
       "Druid automatically looks for query patterns that benefit from from 
approximation. In this instance, Druid identifies a match for approximate 
`COUNT(DISTINCT)`.\n",
   ```



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample 
data, using defaults, into the table 
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "* A running Druid instance.\n",
+    "   * Update the `druid_host` variable to point to your Router endpoint. 
For example, `druid_host = \"http://localhost:8888\"`.\n";,
+    "* The following Python packages:\n",
+    "   * `druidapi`, a Python client for Apache Druid\n",
+    "\n",
+    "To start this tutorial, run the next cell. It defines variables for two 
datasources and the Druid host the tutorial uses. The quickstart deployment 
configures Druid to listen on port `8888` by default, so you'll make API calls 
against `http://localhost:8888`.\n";

Review Comment:
   we're actually hitting `router:8888` see line 47. If they are running druid 
& jupyter in docker, this is right. If they are running docker on the host 
machine, it's a different hostname. `host.docker.internal`. 



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Counting distinct values\n",
+    "\n",
+    "__It's extremely common for analysts to want to count unique occurences 
of some dimension value in data. With the Druid database's history of large 
volumes of data comes an advanced computer science technique to speed up this 
calculation through approximation. In this tutorial, work through some examples 
and see the effect of turning it on and off, and of making it even faster by 
pre-generating the objects that Druid uses to execute the query.__\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample 
data, using defaults, into the table 
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "* A running Druid instance.\n",
+    "   * Update the `druid_host` variable to point to your Router endpoint. 
For example, `druid_host = \"http://localhost:8888\"`.\n";,
+    "* The following Python packages:\n",
+    "   * `druidapi`, a Python client for Apache Druid\n",
+    "\n",
+    "To start this tutorial, run the next cell. It defines variables for two 
datasources and the Druid host the tutorial uses. The quickstart deployment 
configures Druid to listen on port `8888` by default, so you'll make API calls 
against `http://localhost:8888`.\n";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host\n",
+    "\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "display.tables('INFORMATION_SCHEMA')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+   "metadata": {},
+   "source": [
+    "## COUNT(DISTINCT) queries on basic datasets\n",
+    "\n",
+    "Here's a very simple query to find the number of distinct Tail Numbers in 
the example dataset.\n",
+    "\n",
+    "```sql\n",
+    "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS 
\"Events\"\n",
+    "FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "95a8d8bf-69fa-4266-b171-cb550009e89e",
+   "metadata": {},
+   "source": [
+    "### Running COUNT(DISTINCT) with approximation\n",
+    "\n",
+    "Druid will automatically look for patterns of query that can make use of 
approximation. In this instance, Druid will identify a match for approximate 
`COUNT(DISTINCT)`.\n",
+    "\n",
+    "Each data server computes its own intermediate results for merging into 
the final result set. In a `COUNT(DISTINCT)` query, that means that they will 
create their own count inside a representation called a [data 
sketch](https://datasketches.apache.org/). These much smaller objects are then 
merged together when the query results are finalized, rather than Druid having 
to combine the individual lists of distinct values from each process.\n",
+    "\n",
+    "This translates into much faster query execution, especially when the 
intermediate results are large – say when there are a lot of unique values in 
the source data.\n",
+    "\n",
+    "It also means that the most scalable part of Druid – the individual data 
servers – do much more of the work, and do so earlier, instead of leaving it to 
the merge stage.\n",
+    "\n",
+    "> Approximations improve scalability, storage, and memory use - at the 
cost of some error.\n",
+    "> \n",
+    "> _[Gian Merlino](https://github.com/gianm)_\n",
+    "\n",
+    "Let's run this query with all of Druid's defaults to see what the results 
are like. (We can safely omit a `__time` filter thanks to the tiny size of the 
example dataset.)"

Review Comment:
   we tend to avoid first person for second person. Maybe "Try running this 
query... " also avoid parens



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] 202307-notebook Count Distinct (inc HLL + Theta) (druid)

Reply via email to