techdocsmith commented on code in PR #14523:
URL: https://github.com/apache/druid/pull/14523#discussion_r1256607802
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
Review Comment:
Why is this bold?
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample
data, using defaults, into the table
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
Review Comment:
we don't want to just give them an API to load this in the tutorial? as a
part of the setup or something? Would also give the opportunity to make the
table name a little shorter / easier to read / less error prone?
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample
data, using defaults, into the table
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "* A running Druid instance.\n",
+ " * Update the `druid_host` variable to point to your Router endpoint.
For example, `druid_host = \"http://localhost:8888\"`.\n",
+ "* The following Python packages:\n",
+ " * `druidapi`, a Python client for Apache Druid\n",
+ "\n",
+ "To start this tutorial, run the next cell. It defines variables for two
datasources and the Druid host the tutorial uses. The quickstart deployment
configures Druid to listen on port `8888` by default, so you'll make API calls
against `http://localhost:8888`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://router:8888\"\n",
+ "druid_host\n",
+ "\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "display.tables('INFORMATION_SCHEMA')"
Review Comment:
i don't think we need the information schema here
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample
data, using defaults, into the table
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "* A running Druid instance.\n",
+ " * Update the `druid_host` variable to point to your Router endpoint.
For example, `druid_host = \"http://localhost:8888\"`.\n",
+ "* The following Python packages:\n",
+ " * `druidapi`, a Python client for Apache Druid\n",
+ "\n",
+ "To start this tutorial, run the next cell. It defines variables for two
datasources and the Druid host the tutorial uses. The quickstart deployment
configures Druid to listen on port `8888` by default, so you'll make API calls
against `http://localhost:8888`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://router:8888\"\n",
+ "druid_host\n",
+ "\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "display.tables('INFORMATION_SCHEMA')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## COUNT(DISTINCT) queries on basic datasets\n",
+ "\n",
+ "Here's a very simple query to find the number of distinct Tail Numbers in
the example dataset.\n",
+ "\n",
+ "```sql\n",
+ "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS
\"Events\"\n",
+ "FROM
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "95a8d8bf-69fa-4266-b171-cb550009e89e",
+ "metadata": {},
+ "source": [
+ "### Running COUNT(DISTINCT) with approximation\n",
+ "\n",
+ "Druid will automatically look for patterns of query that can make use of
approximation. In this instance, Druid will identify a match for approximate
`COUNT(DISTINCT)`.\n",
+ "\n",
+ "Each data server computes its own intermediate results for merging into
the final result set. In a `COUNT(DISTINCT)` query, that means that they will
create their own count inside a representation called a [data
sketch](https://datasketches.apache.org/). These much smaller objects are then
merged together when the query results are finalized, rather than Druid having
to combine the individual lists of distinct values from each process.\n",
+ "\n",
+ "This translates into much faster query execution, especially when the
intermediate results are large – say when there are a lot of unique values in
the source data.\n",
+ "\n",
+ "It also means that the most scalable part of Druid – the individual data
servers – do much more of the work, and do so earlier, instead of leaving it to
the merge stage.\n",
+ "\n",
+ "> Approximations improve scalability, storage, and memory use - at the
cost of some error.\n",
+ "> \n",
+ "> _[Gian Merlino](https://github.com/gianm)_\n",
+ "\n",
+ "Let's run this query with all of Druid's defaults to see what the results
are like. (We can safely omit a `__time` filter thanks to the tiny size of the
example dataset.)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS
\"Events\"\n",
+ "FROM
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2\n",
+ "'''\n",
+ "df = pd.DataFrame(sql_client.sql(sql))\n",
+ "\n",
+ "df.plot.bar(x='Reporting_Airline', y='Events')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8f37d854-efd9-401a-8726-9949bff0c012",
+ "metadata": {},
+ "source": [
+ "### Running COUNT(DISTINCT) without approximation\n",
+ "\n",
+ "We can supply a query context parameter, `useApproximateCountDistinct`,
to force Druid to not use approximation. We won't get the speed boost afforded
by the sketching approach – but that's OK because the example dataset is so
small! It would be a different story if `Tail_Number` had high cardinality -
like if it was IP Addresses or User Identifiers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "652988ac-c256-46d4-a4ea-dbcf0e023991",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS
\"Events\"\n",
+ "FROM
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2\n",
+ "'''\n",
+ "\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_context(\"useApproximateCountDistinct\", \"false\")\n",
+ "resp = sql_client.sql_query(req)\n",
+ "\n",
+ "df = pd.DataFrame(resp.rows)\n",
+ "df.plot.bar(x='Reporting_Airline', y='Events')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "08c91329-8d05-46eb-8c19-5eaf9043dcb6",
+ "metadata": {},
+ "source": [
+ "### Comparing approximate and non-approximate results\n",
+ "\n",
+ "On the surface, these do not _look_ different. And, in a lot of user
interfaces, that's perfectly fine!\n",
+ "\n",
+ "But let's go a bit deeper and see how the results actually differ."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c05a031f-a805-45dd-935b-d8af808041a6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS
\"Events\"\n",
+ "FROM
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2\n",
+ "'''\n",
+ "\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_context(\"useApproximateCountDistinct\", \"false\")\n",
+ "resp = sql_client.sql_query(req)\n",
+ "\n",
+ "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+ "df2 = pd.DataFrame(resp.rows)\n",
+ "\n",
+ "df3 = df1.compare(df2, keep_equal=True)\n",
+ "df3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c8f3320d-d4ec-460a-b1fc-59c98f85cc3a",
+ "metadata": {},
+ "source": [
+ "There are _value_ errors, as you might expect with approximation. This
therefore affects _ordering_ of results.\n",
+ "\n",
+ "_Error in sketch-based approximation is probabilistic, rather than
guaranteed. That's to say that a certain percentage of the time you can expect
the measurements you take to be within a certain distance of the true value.
Also, their size is not dependent on the data – the default size of a sketch in
Druid is just over 2000 bytes._\n",
Review Comment:
Why is this in italics?
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample
data, using defaults, into the table
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "* A running Druid instance.\n",
+ " * Update the `druid_host` variable to point to your Router endpoint.
For example, `druid_host = \"http://localhost:8888\"`.\n",
+ "* The following Python packages:\n",
+ " * `druidapi`, a Python client for Apache Druid\n",
+ "\n",
+ "To start this tutorial, run the next cell. It defines variables for two
datasources and the Druid host the tutorial uses. The quickstart deployment
configures Druid to listen on port `8888` by default, so you'll make API calls
against `http://localhost:8888`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://router:8888\"\n",
+ "druid_host\n",
+ "\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "display.tables('INFORMATION_SCHEMA')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## COUNT(DISTINCT) queries on basic datasets\n",
+ "\n",
+ "Here's a very simple query to find the number of distinct Tail Numbers in
the example dataset.\n",
+ "\n",
+ "```sql\n",
+ "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS
\"Events\"\n",
+ "FROM
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "95a8d8bf-69fa-4266-b171-cb550009e89e",
+ "metadata": {},
+ "source": [
+ "### Running COUNT(DISTINCT) with approximation\n",
+ "\n",
+ "Druid will automatically look for patterns of query that can make use of
approximation. In this instance, Druid will identify a match for approximate
`COUNT(DISTINCT)`.\n",
Review Comment:
```suggestion
"Druid automatically looks for query patterns that benefit from from
approximation. In this instance, Druid identifies a match for approximate
`COUNT(DISTINCT)`.\n",
```
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample
data, using defaults, into the table
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "* A running Druid instance.\n",
+ " * Update the `druid_host` variable to point to your Router endpoint.
For example, `druid_host = \"http://localhost:8888\"`.\n",
+ "* The following Python packages:\n",
+ " * `druidapi`, a Python client for Apache Druid\n",
+ "\n",
+ "To start this tutorial, run the next cell. It defines variables for two
datasources and the Druid host the tutorial uses. The quickstart deployment
configures Druid to listen on port `8888` by default, so you'll make API calls
against `http://localhost:8888`.\n"
Review Comment:
we're actually hitting `router:8888` see line 47. If they are running druid
& jupyter in docker, this is right. If they are running docker on the host
machine, it's a different hostname. `host.docker.internal`.
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-approxCountDistinct.ipynb:
##########
@@ -0,0 +1,470 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Counting distinct values\n",
+ "\n",
+ "__It's extremely common for analysts to want to count unique occurences
of some dimension value in data. With the Druid database's history of large
volumes of data comes an advanced computer science technique to speed up this
calculation through approximation. In this tutorial, work through some examples
and see the effect of turning it on and off, and of making it even faster by
pre-generating the objects that Druid uses to execute the query.__\n",
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "You must also have loaded the \"FlightCarrierOnTime (1 month)\" sample
data, using defaults, into the table
`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`.\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "* A running Druid instance.\n",
+ " * Update the `druid_host` variable to point to your Router endpoint.
For example, `druid_host = \"http://localhost:8888\"`.\n",
+ "* The following Python packages:\n",
+ " * `druidapi`, a Python client for Apache Druid\n",
+ "\n",
+ "To start this tutorial, run the next cell. It defines variables for two
datasources and the Druid host the tutorial uses. The quickstart deployment
configures Druid to listen on port `8888` by default, so you'll make API calls
against `http://localhost:8888`.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://router:8888\"\n",
+ "druid_host\n",
+ "\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "display.tables('INFORMATION_SCHEMA')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## COUNT(DISTINCT) queries on basic datasets\n",
+ "\n",
+ "Here's a very simple query to find the number of distinct Tail Numbers in
the example dataset.\n",
+ "\n",
+ "```sql\n",
+ "SELECT \"Reporting_Airline\", COUNT(DISTINCT \"Tail_Number\") AS
\"Events\"\n",
+ "FROM
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "95a8d8bf-69fa-4266-b171-cb550009e89e",
+ "metadata": {},
+ "source": [
+ "### Running COUNT(DISTINCT) with approximation\n",
+ "\n",
+ "Druid will automatically look for patterns of query that can make use of
approximation. In this instance, Druid will identify a match for approximate
`COUNT(DISTINCT)`.\n",
+ "\n",
+ "Each data server computes its own intermediate results for merging into
the final result set. In a `COUNT(DISTINCT)` query, that means that they will
create their own count inside a representation called a [data
sketch](https://datasketches.apache.org/). These much smaller objects are then
merged together when the query results are finalized, rather than Druid having
to combine the individual lists of distinct values from each process.\n",
+ "\n",
+ "This translates into much faster query execution, especially when the
intermediate results are large – say when there are a lot of unique values in
the source data.\n",
+ "\n",
+ "It also means that the most scalable part of Druid – the individual data
servers – do much more of the work, and do so earlier, instead of leaving it to
the merge stage.\n",
+ "\n",
+ "> Approximations improve scalability, storage, and memory use - at the
cost of some error.\n",
+ "> \n",
+ "> _[Gian Merlino](https://github.com/gianm)_\n",
+ "\n",
+ "Let's run this query with all of Druid's defaults to see what the results
are like. (We can safely omit a `__time` filter thanks to the tiny size of the
example dataset.)"
Review Comment:
we tend to avoid first person for second person. Maybe "Try running this
query... " also avoid parens
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]