techdocsmith commented on code in PR #14726:
URL: https://github.com/apache/druid/pull/14726#discussion_r1295219171
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
Review Comment:
```suggestion
"You can optionally monitor the ingestion in the Druid console while it
runs."
```
nit
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a399196b-12db-42ff-ae24-c7232f150aba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-wikipedia-unionall\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"isRobot\",\n",
+ " \"channel\",\n",
+ " \"flags\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"page\",\n",
+ " \"diffUrl\",\n",
+ " \"added\",\n",
+ " \"comment\",\n",
+ " \"commentLength\",\n",
+ " \"isNew\",\n",
+ " \"isMinor\",\n",
+ " \"delta\",\n",
+ " \"isAnonymous\",\n",
+ " \"user\",\n",
+ " \"deltaBucket\",\n",
+ " \"deleted\",\n",
+ " \"namespace\",\n",
+ " \"cityName\",\n",
+ " \"countryName\",\n",
+ " \"regionIsoCode\",\n",
+ " \"metroCode\",\n",
+ " \"countryIsoCode\",\n",
+ " \"regionName\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-wikipedia-unionall')\n",
+ "display.table('example-wikipedia-unionall')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24b47cc3-68f5-4a73-b374-94bbfa32d91d",
+ "metadata": {},
+ "source": [
+ "With `UNION ALL`, we can append the results of one query with another.\n",
+ "\n",
+ "The first query in the cell below, `set1`, returns the ten first edits to
any \"fr\"-like `channel` between midday and 1pm on the 27th June 2016. The
second query repeats this but for any \"en\"-like `channel`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#fr%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#en%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " )\n",
+ " \n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5e77fa9-a60c-4955-b763-58d970d7326d",
+ "metadata": {},
+ "source": [
+ "This is what's known as a
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level)
`UNION` operation. First, `set1` was calculated, and the results of subsequent
sets were then appended.\n",
Review Comment:
```suggestion
"This is a
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level)
`UNION` operation. First, Druid calculated `set1` and appended subsequent
results sets.\n",
```
nit. extra words. passive
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a399196b-12db-42ff-ae24-c7232f150aba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-wikipedia-unionall\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"isRobot\",\n",
+ " \"channel\",\n",
+ " \"flags\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"page\",\n",
+ " \"diffUrl\",\n",
+ " \"added\",\n",
+ " \"comment\",\n",
+ " \"commentLength\",\n",
+ " \"isNew\",\n",
+ " \"isMinor\",\n",
+ " \"delta\",\n",
+ " \"isAnonymous\",\n",
+ " \"user\",\n",
+ " \"deltaBucket\",\n",
+ " \"deleted\",\n",
+ " \"namespace\",\n",
+ " \"cityName\",\n",
+ " \"countryName\",\n",
+ " \"regionIsoCode\",\n",
+ " \"metroCode\",\n",
+ " \"countryIsoCode\",\n",
+ " \"regionName\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-wikipedia-unionall')\n",
+ "display.table('example-wikipedia-unionall')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24b47cc3-68f5-4a73-b374-94bbfa32d91d",
+ "metadata": {},
+ "source": [
+ "With `UNION ALL`, we can append the results of one query with another.\n",
+ "\n",
+ "The first query in the cell below, `set1`, returns the ten first edits to
any \"fr\"-like `channel` between midday and 1pm on the 27th June 2016. The
second query repeats this but for any \"en\"-like `channel`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#fr%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#en%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " )\n",
+ " \n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5e77fa9-a60c-4955-b763-58d970d7326d",
+ "metadata": {},
+ "source": [
+ "This is what's known as a
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level)
`UNION` operation. First, `set1` was calculated, and the results of subsequent
sets were then appended.\n",
+ "\n",
+ "Notice that these results are not in order by time – even though the
individual sets did `ORDER BY` time. Druid simply concatenated the two result
sets together.\n",
+ "\n",
+ "Optionally, run the next cell – it shows the precise `EXPLAIN PLAN` for
the query. You can see there are two `query` execution plans, one for each
query, and that Druid's planning process has taken time to optimize how the
query above will actually execute."
Review Comment:
```suggestion
"Optionally, run the next cell to show the precise `EXPLAIN PLAN` for
the query. You can see there are two `query` execution plans, one for each
subquery. Also, Druid's planning process optimized execution of the outer
query."
```
nit: wordy
also: is there a resource to link to to help with reading explain plans?
some of us might not be expert 😳
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
Review Comment:
Address placeholder text.
```suggestion
"Finally, run the following cell to import the Python JSON module."
```
I think JSON is pretty standard so you don't need a big description of how
you'll use it. Also @sergioferragut suggested placing imports near where
they're used, but for this PR, it is enough to clean up the placeholder.
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a399196b-12db-42ff-ae24-c7232f150aba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-wikipedia-unionall\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"isRobot\",\n",
+ " \"channel\",\n",
+ " \"flags\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"page\",\n",
+ " \"diffUrl\",\n",
+ " \"added\",\n",
+ " \"comment\",\n",
+ " \"commentLength\",\n",
+ " \"isNew\",\n",
+ " \"isMinor\",\n",
+ " \"delta\",\n",
+ " \"isAnonymous\",\n",
+ " \"user\",\n",
+ " \"deltaBucket\",\n",
+ " \"deleted\",\n",
+ " \"namespace\",\n",
+ " \"cityName\",\n",
+ " \"countryName\",\n",
+ " \"regionIsoCode\",\n",
+ " \"metroCode\",\n",
+ " \"countryIsoCode\",\n",
+ " \"regionName\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-wikipedia-unionall')\n",
+ "display.table('example-wikipedia-unionall')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24b47cc3-68f5-4a73-b374-94bbfa32d91d",
+ "metadata": {},
+ "source": [
+ "With `UNION ALL`, we can append the results of one query with another.\n",
+ "\n",
+ "The first query in the cell below, `set1`, returns the ten first edits to
any \"fr\"-like `channel` between midday and 1pm on the 27th June 2016. The
second query repeats this but for any \"en\"-like `channel`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#fr%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#en%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " )\n",
+ " \n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5e77fa9-a60c-4955-b763-58d970d7326d",
+ "metadata": {},
+ "source": [
+ "This is what's known as a
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level)
`UNION` operation. First, `set1` was calculated, and the results of subsequent
sets were then appended.\n",
+ "\n",
+ "Notice that these results are not in order by time – even though the
individual sets did `ORDER BY` time. Druid simply concatenated the two result
sets together.\n",
+ "\n",
+ "Optionally, run the next cell – it shows the precise `EXPLAIN PLAN` for
the query. You can see there are two `query` execution plans, one for each
query, and that Druid's planning process has taken time to optimize how the
query above will actually execute."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "97934da2-17d1-4c91-8ae3-926cc89185c1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']),
indent=2))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "800add1a-d459-4796-b974-b2f094db417f",
+ "metadata": {},
+ "source": [
+ "Run next cell to perform another top-level UNION ALL, this time where the
sets use `GROUP BY`.\n",
+ "\n",
+ "Notice that the aggregates have `AS` to set specific field names."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8e687466-74bb-4cc0-ba17-913d1807fc60",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " TIME_FLOOR(__time, 'PT1H') AS \"Period\",\n",
+ " countryName,\n",
+ " AVG(commentLength) AS \"Average Comment Size\",\n",
+ " COUNT(DISTINCT \"page\") AS \"Pages\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE countryName='China'\n",
+ " GROUP BY 1, 2\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " TIME_FLOOR(__time, 'PT1H') AS \"Episode\",\n",
+ " countryName,\n",
+ " COUNT(DISTINCT \"page\") AS \"Pages\",\n",
+ " AVG(commentLength) AS \"Average Comment Length\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE countryName='Austria'\n",
+ " GROUP BY 1, 2\n",
+ " LIMIT 10\n",
+ " )\n",
+ "\n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f2c95ffc-b260-4671-bacc-c8cc3137e9c2",
+ "metadata": {},
+ "source": [
+ "Look carefully at these results - results from `set2` were appended very
simply.\n",
Review Comment:
```suggestion
"Look carefully at these results - Druid has simply appended the results
from `set2` to `set1` without introducing redundant columns.\n",
```
nit. passive + "very"
Great demo of this behavior!
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
Review Comment:
```suggestion
"Run the next cell to attempt a connection to Druid services. If
successful, the output shows the Druid version number."
```
nit:passive
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a399196b-12db-42ff-ae24-c7232f150aba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-wikipedia-unionall\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"isRobot\",\n",
+ " \"channel\",\n",
+ " \"flags\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"page\",\n",
+ " \"diffUrl\",\n",
+ " \"added\",\n",
+ " \"comment\",\n",
+ " \"commentLength\",\n",
+ " \"isNew\",\n",
+ " \"isMinor\",\n",
+ " \"delta\",\n",
+ " \"isAnonymous\",\n",
+ " \"user\",\n",
+ " \"deltaBucket\",\n",
+ " \"deleted\",\n",
+ " \"namespace\",\n",
+ " \"cityName\",\n",
+ " \"countryName\",\n",
+ " \"regionIsoCode\",\n",
+ " \"metroCode\",\n",
+ " \"countryIsoCode\",\n",
+ " \"regionName\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-wikipedia-unionall')\n",
+ "display.table('example-wikipedia-unionall')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24b47cc3-68f5-4a73-b374-94bbfa32d91d",
+ "metadata": {},
+ "source": [
+ "With `UNION ALL`, we can append the results of one query with another.\n",
+ "\n",
+ "The first query in the cell below, `set1`, returns the ten first edits to
any \"fr\"-like `channel` between midday and 1pm on the 27th June 2016. The
second query repeats this but for any \"en\"-like `channel`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#fr%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#en%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " )\n",
+ " \n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5e77fa9-a60c-4955-b763-58d970d7326d",
+ "metadata": {},
+ "source": [
+ "This is what's known as a
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level)
`UNION` operation. First, `set1` was calculated, and the results of subsequent
sets were then appended.\n",
+ "\n",
+ "Notice that these results are not in order by time – even though the
individual sets did `ORDER BY` time. Druid simply concatenated the two result
sets together.\n",
+ "\n",
+ "Optionally, run the next cell – it shows the precise `EXPLAIN PLAN` for
the query. You can see there are two `query` execution plans, one for each
query, and that Druid's planning process has taken time to optimize how the
query above will actually execute."
Review Comment:
```suggestion
"Optionally, run the next cell to show the precise `EXPLAIN PLAN` for
the query. You can see there are two `query` execution plans, one for each
subquery. Also, Druid's planning process optimized execution of the outer
query."
```
nit: wordy
also: is there a resource to link to to help with reading explain plans?
some of us might not be expert 😳
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a399196b-12db-42ff-ae24-c7232f150aba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-wikipedia-unionall\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"isRobot\",\n",
+ " \"channel\",\n",
+ " \"flags\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"page\",\n",
+ " \"diffUrl\",\n",
+ " \"added\",\n",
+ " \"comment\",\n",
+ " \"commentLength\",\n",
+ " \"isNew\",\n",
+ " \"isMinor\",\n",
+ " \"delta\",\n",
+ " \"isAnonymous\",\n",
+ " \"user\",\n",
+ " \"deltaBucket\",\n",
+ " \"deleted\",\n",
+ " \"namespace\",\n",
+ " \"cityName\",\n",
+ " \"countryName\",\n",
+ " \"regionIsoCode\",\n",
+ " \"metroCode\",\n",
+ " \"countryIsoCode\",\n",
+ " \"regionName\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-wikipedia-unionall')\n",
+ "display.table('example-wikipedia-unionall')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24b47cc3-68f5-4a73-b374-94bbfa32d91d",
+ "metadata": {},
+ "source": [
+ "With `UNION ALL`, we can append the results of one query with another.\n",
Review Comment:
```suggestion
"You can use `UNION ALL` to append the results of one query with
another.\n",
```
nit: favor 2nd persion "you" over 1st person "we"
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,541 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using `UNION ALL` to address multiple `TABLE`s in the same query\n",
+ "\n",
+ "While working with Druid, you may need to bring together two different
tables of results together into a single result list, or to treat multiple
tables as a single input to a query. This notebook introduces the `UNION ALL`
operator, walking through two ways in which this operator can be used to
achieve this result: top-level and table-level `UNION ALL`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run Druid without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance, with a `DRUID_HOST` local environment
variable containing the servername of your Druid router\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+ "metadata": {},
+ "source": [
+ "### Initialization\n",
+ "\n",
+ "Run the next cell to attempt a connection to Druid services. If
successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc3001a0-27e5-4f41-876a-ce6eab2acd6a",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use to X, Y, Z."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6b058d8b-2bae-4929-ab0c-5a6df1850387",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Using Top-level `UNION ALL` to concatenate result sets\n",
+ "\n",
+ "Run the following cell to ingest the wikipedia data example. Once
completed, you will see a description of the new table.\n",
+ "\n",
+ "Monitor the ingestion in the Druid console while it runs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a399196b-12db-42ff-ae24-c7232f150aba",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-wikipedia-unionall\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"isRobot\",\n",
+ " \"channel\",\n",
+ " \"flags\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"page\",\n",
+ " \"diffUrl\",\n",
+ " \"added\",\n",
+ " \"comment\",\n",
+ " \"commentLength\",\n",
+ " \"isNew\",\n",
+ " \"isMinor\",\n",
+ " \"delta\",\n",
+ " \"isAnonymous\",\n",
+ " \"user\",\n",
+ " \"deltaBucket\",\n",
+ " \"deleted\",\n",
+ " \"namespace\",\n",
+ " \"cityName\",\n",
+ " \"countryName\",\n",
+ " \"regionIsoCode\",\n",
+ " \"metroCode\",\n",
+ " \"countryIsoCode\",\n",
+ " \"regionName\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-wikipedia-unionall')\n",
+ "display.table('example-wikipedia-unionall')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24b47cc3-68f5-4a73-b374-94bbfa32d91d",
+ "metadata": {},
+ "source": [
+ "With `UNION ALL`, we can append the results of one query with another.\n",
+ "\n",
+ "The first query in the cell below, `set1`, returns the ten first edits to
any \"fr\"-like `channel` between midday and 1pm on the 27th June 2016. The
second query repeats this but for any \"en\"-like `channel`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#fr%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " __time,\n",
+ " \"channel\",\n",
+ " \"page\",\n",
+ " \"isRobot\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2016-06-27 12:00:00'\n",
+ " AND channel LIKE '#en%'\n",
+ " ORDER BY __time\n",
+ " LIMIT 10\n",
+ " )\n",
+ " \n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5e77fa9-a60c-4955-b763-58d970d7326d",
+ "metadata": {},
+ "source": [
+ "This is what's known as a
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level)
`UNION` operation. First, `set1` was calculated, and the results of subsequent
sets were then appended.\n",
+ "\n",
+ "Notice that these results are not in order by time – even though the
individual sets did `ORDER BY` time. Druid simply concatenated the two result
sets together.\n",
+ "\n",
+ "Optionally, run the next cell – it shows the precise `EXPLAIN PLAN` for
the query. You can see there are two `query` execution plans, one for each
query, and that Druid's planning process has taken time to optimize how the
query above will actually execute."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "97934da2-17d1-4c91-8ae3-926cc89185c1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']),
indent=2))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "800add1a-d459-4796-b974-b2f094db417f",
+ "metadata": {},
+ "source": [
+ "Run next cell to perform another top-level UNION ALL, this time where the
sets use `GROUP BY`.\n",
+ "\n",
+ "Notice that the aggregates have `AS` to set specific field names."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8e687466-74bb-4cc0-ba17-913d1807fc60",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "WITH\n",
+ "set1 AS (\n",
+ " SELECT\n",
+ " TIME_FLOOR(__time, 'PT1H') AS \"Period\",\n",
+ " countryName,\n",
+ " AVG(commentLength) AS \"Average Comment Size\",\n",
+ " COUNT(DISTINCT \"page\") AS \"Pages\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE countryName='China'\n",
+ " GROUP BY 1, 2\n",
+ " LIMIT 10\n",
+ " ),\n",
+ "set2 AS (\n",
+ " SELECT\n",
+ " TIME_FLOOR(__time, 'PT1H') AS \"Episode\",\n",
+ " countryName,\n",
+ " COUNT(DISTINCT \"page\") AS \"Pages\",\n",
+ " AVG(commentLength) AS \"Average Comment Length\"\n",
+ " FROM \"example-wikipedia-unionall\"\n",
+ " WHERE countryName='Austria'\n",
+ " GROUP BY 1, 2\n",
+ " LIMIT 10\n",
+ " )\n",
+ "\n",
+ "SELECT * from set1\n",
+ "UNION ALL\n",
+ "SELECT * from set2\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f2c95ffc-b260-4671-bacc-c8cc3137e9c2",
+ "metadata": {},
+ "source": [
+ "Look carefully at these results - results from `set2` were appended very
simply.\n",
+ "\n",
+ "* Column name in `set2` (`Period` versus `Episode` and `Average Comment
Size` versus `Average Comment Length`) did not result in new columns\n",
+ "* Columns with the same name (`Pages`) did not result in that aggregate
being put into same column - Austria's values are simply appended `Average
Comment Size`\n",
+ "\n",
+ "Run the next cell, which uses explicit column names at the top-level,
rather than `*`, to ensure the calculations appear in the right columns in the
final result. It also explicitly names the column names for the results by
using `AS`."
Review Comment:
```suggestion
"Run the next cell, which uses explicit column names at the top-level,
rather than `*`, to ensure the calculations appear in the right columns in the
final result. It also aliases the columns for the results by using `AS`."
```
I think this is called "aliasing", but be sure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]