Re: [PR] 202307-notebook-unionall (druid)

via GitHub Wed, 02 Aug 2023 01:03:30 -0700


petermarshallio commented on code in PR #14726:
URL: https://github.com/apache/druid/pull/14726#discussion_r1281543982



##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/04-UnionOperations.ipynb:
##########
@@ -0,0 +1,644 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Performing set operations\n",
+    "\n",
+    "Users often call for a way to concatenate results into a single list. In 
this tutorial, work through some examples of different techniques that are 
available."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf4554ae-6516-4e76-b202-d6e2fdf31603",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "<details><summary>    \n",
+    "<b>Run without Docker Compose</b>    \n",
+    "</summary>\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Druid instance.\n",
+    "* 
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
 a Python client for Apache Druid. Follow the instructions in the Install 
section of the README file.\n",
+    "* [matplotlib](https://matplotlib.org/), a library for creating 
visualizations in Python,\n",
+    "* [pandas](https://pandas.pydata.org/), a data analysis and manipulation 
tool.\n",
+    "* Jupyter notebook or Jupyter Lab. See 
[jupyter.org](https://jupyter.org/) for installation instructions.\n",
+    "\n",
+    "</details>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ee0c3171-def8-4ad9-9c56-d3a67f309631",
+   "metadata": {},
+   "source": [
+    "### Initialization\n",
+    "\n",
+    "Run the next cell to attempt a connection to Druid services. If 
successful, the Druid version number will be shown in the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9fa4abfe-f878-4031-88f2-94c13e922279",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import os\n",
+    "\n",
+    "if 'DRUID_HOST' not in os.environ.keys():\n",
+    "    druid_host=f\"http://localhost:8888\"\n";,
+    "else:\n",
+    "    druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+    "    \n",
+    "print(f\"Opening a connection to {druid_host}.\")\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "status_client = druid.status\n",
+    "\n",
+    "status_client.version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d948743d-10cf-4df9-bff2-92a79535ec89",
+   "metadata": {},
+   "source": [
+    "### Load example flight data\n",
+    "\n",
+    "Once your Druid environment is up and running, ingest the sample data for 
this tutorial.\n",
+    "\n",
+    "Open the Druid console:\n",
+    "\n",
+    "1. Load data\n",
+    "2. Batch - SQL\n",
+    "3. Example data\n",
+    "4. Select \"FlightCarrierOnTime (1 month)\"\n",
+    "\n",
+    "For the purposes of this notebook, use all the defaults suggested by the 
console, including the default datasource name: \n",
+    "\n",
+    "`On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b3c7fc0-cb5c-43a0-aa53-2c62053181b0",
+   "metadata": {},
+   "source": [
+    "When this is completed, run the following cell for the final part of the 
initialization. This will provide us some methods to call as we explore what 
TopN does."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+   "metadata": {},
+   "source": [
+    "## Concatenating result sets with `UNION ALL`\n",
+    "\n",
+    "Execute the following query to combine together two different queries - 
one that contains 10 flights taking off from San Fransisco at around 11 o'clock 
in the morning, and another with flights departing from Atlanta in the same 
hour."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "WITH\n",
+    "set1 AS (\n",
+    "  SELECT\n",
+    "  __time,\n",
+    "  \"Origin\",\n",
+    "  \"Tail_Number\",\n",
+    "  \"Flight_Number_Reporting_Airline\"\n",
+    "  FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "  WHERE Origin = 'SFO'\n",
+    "  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "  ORDER BY __time\n",
+    "  LIMIT 10\n",
+    "  ),\n",
+    "set2 AS (\n",
+    "SELECT\n",
+    "  __time,\n",
+    "  \"Origin\",\n",
+    "  \"Tail_Number\",\n",
+    "  \"Flight_Number_Reporting_Airline\"\n",
+    "  FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "  WHERE Origin = 'ATL'\n",
+    "  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "  ORDER BY __time\n",
+    "  LIMIT 10\n",
+    "  )\n",
+    "  \n",
+    "SELECT * from set1\n",
+    "UNION ALL\n",
+    "SELECT * from set2\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5e77fa9-a60c-4955-b763-58d970d7326d",
+   "metadata": {},
+   "source": [
+    "This is what's known as a 
[top-level](https://druid.apache.org/docs/latest/querying/sql.html#top-level) 
`UNION` operation: each set of results was gathered individually, one after the 
other, and the list of results concatenated.\n",
+    "\n",
+    "Notice that these results are not in order by time – even though the 
individual sets did `ORDER BY` time. Druid has simply concatenated the two 
result sets together.\n",
+    "\n",
+    "The following cell contains another valid use of `UNION ALL`, where some 
filtering and `GROUP BY` operations have been added. Notice there is a mix of 
this being done in the sets themselves and as part of the top level `UNION`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3514ad80-cbe6-4563-8b7b-9d48f8b9a508",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "WITH\n",
+    "set1 AS (\n",
+    "    SELECT \"Reporting_Airline\", \"Distance\"\n",
+    "    FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "  ),\n",
+    "set2 AS (\n",
+    "    SELECT\n",
+    "        \"Reporting_Airline\",\n",
+    "        COUNT(*) AS \"Frights\",\n",
+    "        MAX(Distance) AS \"Lengthiest\",\n",
+    "        MIN(Distance) AS \"Shortest\"\n",
+    "    FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-02 11:00:00'\n",
+    "    AND \"Reporting_Airline\" LIKE 'A%'\n",
+    "    GROUP BY 1\n",
+    "  )\n",
+    "\n",
+    "SELECT \"Reporting_Airline\",\n",
+    "        COUNT(*) AS \"Frights\",\n",
+    "        MAX(Distance) AS \"Lengthiest\",\n",
+    "        MIN(Distance) AS \"Shortest\" from set1\n",
+    "  WHERE \"Reporting_Airline\" LIKE 'AA'\n",
+    "  GROUP BY 1\n",
+    "UNION ALL\n",
+    "SELECT * from set2\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1beb4ad7-93cf-4922-9f57-d7b9739fea4a",
+   "metadata": {},
+   "source": [
+    "Optionally, run the next cell – it shows the precise `EXPLAIN PLAN` for 
the query. You can see there are two `query` exeuction plans, one for each of 
our queries, and that Druid's planning process has taken time to optimize how 
the query above will actually execute."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "97934da2-17d1-4c91-8ae3-926cc89185c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), 
indent=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "800add1a-d459-4796-b974-b2f094db417f",
+   "metadata": {},
+   "source": [
+    "The next cell contains two result sets: `set1` provides some statistics 
for the 1st November, `set2` for the 2nd November.\n",
+    "\n",
+    "The `UNION ALL` operation simply concatenates the two sets of results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8e687466-74bb-4cc0-ba17-913d1807fc60",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "WITH\n",
+    "set1 AS (\n",
+    "    SELECT\n",
+    "        \"Reporting_Airline\",\n",
+    "        COUNT(*) AS \"Flights\",\n",
+    "        MIN(Distance) AS \"Shortest\",\n",
+    "        MAX(Distance) AS \"Longest\"\n",
+    "    FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "    AND \"Reporting_Airline\" LIKE 'A%'\n",
+    "    GROUP BY 1\n",
+    "  ),\n",
+    "set2 AS (\n",
+    "    SELECT\n",
+    "        \"Reporting_Airline\",\n",
+    "        COUNT(*) AS \"Frights\",\n",
+    "        MAX(Distance) AS \"Lengthiest\",\n",
+    "        MIN(Distance) AS \"Shortest\"\n",
+    "    FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-02 11:00:00'\n",
+    "    AND \"Reporting_Airline\" LIKE 'A%'\n",
+    "    GROUP BY 1\n",
+    "  )\n",
+    "\n",
+    "SELECT * from set1\n",
+    "UNION ALL\n",
+    "SELECT * from set2\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2c95ffc-b260-4671-bacc-c8cc3137e9c2",
+   "metadata": {},
+   "source": [
+    "As expected, we see a concatenated set of results - but wait! Why are the 
longest and shortest the wrong way around?!\n",
+    "\n",
+    "As the `UNION ALL` concatenated the sets, it very simply added the 
results for Atlanta to the end of the results for San Francisco. It did not 
take into account that the columns in set 2 were in a different order, nor did 
it take into account _either_ of the errors in field names.\n",
+    "\n",
+    "Instead, the query against `set2` ought to have been more explicit, 
taking into account the proper field names (`flights` vs `frights`!) and giving 
the same order as the first set of results - ie, `set1`'s schema:\n",
+    "\n",
+    "```sql\n",
+    "SELECT \"Flights\", \"Shortest\", \"Longest\" from set1\n",
+    "UNION ALL\n",
+    "SELECT \"Frights\", \"Shortest\", \"Lengthiest\" from set2\n",
+    "```\n",
+    "\n",
+    "Run the cell below to see what difference this makes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "39f9be92-7b2e-417c-b16a-5060b8cd2c30",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "WITH\n",
+    "set1 AS (\n",
+    "    SELECT\n",
+    "        \"Reporting_Airline\",\n",
+    "        COUNT(*) AS \"Flights\",\n",
+    "        MIN(Distance) AS \"Shortest\",\n",
+    "        MAX(Distance) AS \"Longest\"\n",
+    "    FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "    AND \"Reporting_Airline\" LIKE 'A%'\n",
+    "    GROUP BY 1\n",
+    "  ),\n",
+    "set2 AS (\n",
+    "    SELECT\n",
+    "        \"Reporting_Airline\",\n",
+    "        COUNT(*) AS \"Frights\",\n",
+    "        MAX(Distance) AS \"Lengthiest\",\n",
+    "        MIN(Distance) AS \"Shortest\"\n",
+    "    FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    WHERE DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-02 11:00:00'\n",
+    "    AND \"Reporting_Airline\" LIKE 'A%'\n",
+    "    GROUP BY 1\n",
+    "  )\n",
+    "\n",
+    "SELECT \"Flights\", \"Shortest\", \"Longest\" from set1\n",
+    "UNION ALL\n",
+    "SELECT \"Frights\", \"Shortest\", \"Lengthiest\" from set2\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f31b3244-f6bc-4ba6-bb4d-7fb4057d222e",
+   "metadata": {},
+   "source": [
+    "## Working with concatenated result sets\n",
+    "\n",
+    "So far we have seen \"[top 
level](https://druid.apache.org/docs/26.0.0/querying/sql.html#top-level)\" 
UNIONs in operation. That is, the `UNION ALL` is being used very simply at the 
highest level of the query.\n",
+    "\n",
+    "When we want to apply functions to the _entire_ result set like above, 
such as using it in a sub-query or a `FROM` clause, Druid changes how it plans 
and executes the query. Instead of a top level `UNION ALL`, it switches to a 
\"[table 
level](https://druid.apache.org/docs/26.0.0/querying/sql.html#table-level)\" 
query.  For example, to put our very first cell results into `__time` order, we 
might want to use a query that applies `ORDER BY` to a `UNION` of `set1` (San 
Francisco flights) and `set2` (Atlanta flights), like the following:\n",
+    "\n",
+    "```sql\n",
+    "SELECT \"Origin\",\n",
+    "    \"Tail_Number\",\n",
+    "    \"Flight_Number_Reporting_Airline\"\n",
+    "FROM (\n",
+    "    SELECT * from set1\n",
+    "    UNION ALL\n",
+    "    SELECT * from set2\n",
+    "    )\n",
+    "ORDER BY __time\n",
+    "```\n",
+    "\n",
+    "Run the cell below to see what happens."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b557a62-623d-4d16-b4a1-3bd6484efe35",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "WITH\n",
+    "set1 AS (\n",
+    "  SELECT\n",
+    "  __time,\n",
+    "  \"Origin\",\n",
+    "  \"Tail_Number\",\n",
+    "  \"Flight_Number_Reporting_Airline\"\n",
+    "  FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "  WHERE Origin = 'SFO'\n",
+    "  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "  ),\n",
+    "set2 AS (\n",
+    "SELECT\n",
+    "  __time,\n",
+    "  \"Origin\",\n",
+    "  \"Tail_Number\",\n",
+    "  \"Flight_Number_Reporting_Airline\"\n",
+    "  FROM 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "  WHERE Origin = 'ATL'\n",
+    "  AND DATE_TRUNC('HOUR', __time) = TIMESTAMP '2005-11-01 11:00:00'\n",
+    "  )\n",
+    "\n",
+    "SELECT __time,\n",
+    "    \"Origin\",\n",
+    "    \"Tail_Number\",\n",
+    "    \"Flight_Number_Reporting_Airline\"\n",
+    "FROM (\n",
+    "    SELECT * from set1\n",
+    "    UNION ALL\n",
+    "    SELECT * from set2\n",
+    "    )\n",
+    "ORDER BY __time\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "25001794-e1a7-4325-adb3-2b8f26036261",
+   "metadata": {},
+   "source": [
+    "Notice the `ClientError` that Druid returns. Rather than simply 
concatenating two `TABLE`s of results together, Druid switched to table level 
`UNION ALL`, and with it a switch to 
[`UNION`](https://druid.apache.org/docs/26.0.0/querying/datasource.html#union) 
datasources. These cannot be used in the same way as `TABLE`s; there are 
constraints on what can be done with them. An `EXPLAIN PLAN` for these types of 
queries shows this in detail:\n",
+    "\n",
+    "```json\n",
+    "\"type\": \"union\",\n",
+    "\"dataSources\": [\n",
+    "    {\n",
+    "    \"type\": \"table\",\n",
+    "    \"name\": 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    },\n",
+    "    {\n",
+    "    \"type\": \"table\",\n",
+    "    \"name\": 
\"On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11\"\n",
+    "    }\n",
+    "```\n",
+    "\n",
+    "In large deployments of Druid, data engineers may decide to break up a 
single data set into multiple `TABLE`s. Some reasons for this might be:\n",
+    "\n",
+    "* `TABLE`s separated by their `__time` granularity (ie. the level of 
roll-up)\n",
+    "* `TABLE`s logically separated for security reasons by tenant\n",
+    "* `TABLE`s broken up by different filter conditions\n",
+    "* `TABLE`s separated by upstream data source (e.g. different Kafka topics 
for different cloud regions, different groups of source device, and so on)\n",
+    "* `TABLE`s that break the entire data set up into different periods of 
time, perhaps with different retention periods\n",
+    "\n",
+    "The next two cells to create two new tables, `wikipedia-en` and 
`wikipedia-fr`. One table contains only English language channels, while the 
second only contains French language channels. Imagine that this is a design 
decision taken by a data engineer so that they can be governed separately. But 
now we want to have a single view of both that people can query.\n",
+    "\n",
+    "Run these ingestion jobs, and monitor them as they run in the Druid 
Console."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "269c6aef-c3a5-46ad-8332-30b7bf30ddfb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "REPLACE INTO \"wikipedia-en\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\" 
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR, 
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\" 
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT, 
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\" 
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR, 
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR, 
\"regionName\" VARCHAR))\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+    "  \"isRobot\",\n",
+    "  \"channel\",\n",
+    "  \"flags\",\n",
+    "  \"isUnpatrolled\",\n",
+    "  \"page\",\n",
+    "  \"diffUrl\",\n",
+    "  \"added\",\n",
+    "  \"comment\",\n",
+    "  \"commentLength\",\n",
+    "  \"isNew\",\n",
+    "  \"isMinor\",\n",
+    "  \"delta\",\n",
+    "  \"isAnonymous\",\n",
+    "  \"user\",\n",
+    "  \"deltaBucket\",\n",
+    "  \"deleted\",\n",
+    "  \"namespace\",\n",
+    "  \"cityName\",\n",
+    "  \"countryName\",\n",
+    "  \"regionIsoCode\",\n",
+    "  \"metroCode\",\n",
+    "  \"countryIsoCode\",\n",
+    "  \"regionName\"\n",
+    "FROM \"ext\"\n",
+    "WHERE \"channel\" LIKE '#en%'\n",
+    "PARTITIONED BY DAY\n",
+    "'''\n",
+    "\n",
+    "sql_client.run_task(sql)\n",
+    "sql_client.wait_until_ready('wikipedia-en')\n",
+    "display.table('wikipedia-en')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "61740d61-28fc-48e9-b026-d472bd04f390",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "REPLACE INTO \"wikipedia-fr\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\" 
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR, 
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\" 
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT, 
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\" 
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR, 
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR, 
\"regionName\" VARCHAR))\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+    "  \"isRobot\",\n",
+    "  \"channel\",\n",
+    "  \"flags\",\n",
+    "  \"isUnpatrolled\",\n",
+    "  \"page\",\n",
+    "  \"diffUrl\",\n",
+    "  \"added\",\n",
+    "  \"comment\",\n",
+    "  \"commentLength\",\n",
+    "  \"isNew\",\n",
+    "  \"isMinor\",\n",
+    "  \"delta\",\n",
+    "  \"isAnonymous\",\n",
+    "  \"user\",\n",
+    "  \"deltaBucket\",\n",
+    "  \"deleted\",\n",
+    "  \"namespace\",\n",
+    "  \"cityName\",\n",
+    "  \"countryName\",\n",
+    "  \"regionIsoCode\",\n",
+    "  \"metroCode\",\n",
+    "  \"countryIsoCode\",\n",
+    "  \"regionName\"\n",
+    "FROM \"ext\"\n",
+    "WHERE \"channel\" LIKE '#fr%'\n",
+    "PARTITIONED BY DAY\n",
+    "'''\n",
+    "\n",
+    "sql_client.run_task(sql)\n",
+    "sql_client.wait_until_ready('wikipedia-fr')\n",
+    "display.table('wikipedia-fr')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8bbf2c6-681a-46f5-82f2-201cbbe8058d",
+   "metadata": {},
+   "source": [
+    "In the next cell, we declare our `UNION ALL` data source as the single, 
unified view of all the data. Then we use that in our outer `SELECT` to count 
the number of robot and non-robot edits by channel.\n",
+    "\n",
+    "Remember, the `SELECT` in the `unifiedSource` must be simple in order to 
meet the constraints set by a table level `UNION ALL`, so any filtering ans so 
on can only be done in the outer `SELECT` statement."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "783fe77d-2e7b-476a-9748-67ea90c8bb91",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "WITH unifiedSource AS (\n",
+    "    SELECT\n",
+    "        \"__time\",\n",
+    "        \"isRobot\",\n",
+    "        \"channel\",\n",
+    "        \"user\",\n",
+    "        \"countryName\"\n",
+    "    FROM \"wikipedia-en\"\n",
+    "    UNION ALL\n",
+    "    SELECT\n",
+    "        \"__time\",\n",
+    "        \"isRobot\",\n",
+    "        \"channel\",\n",
+    "        \"user\",\n",
+    "        \"countryName\"\n",
+    "    FROM \"wikipedia-fr\"\n",
+    "    )\n",
+    "\n",
+    "SELECT\n",
+    "    \"channel\",\n",
+    "    COUNT(*) FILTER (WHERE isRobot=true) AS \"Robot Edits\",\n",
+    "    COUNT (DISTINCT user) FILTER (WHERE isRobot=true) AS \"Robot 
Editors\",\n",
+    "    COUNT(*) FILTER (WHERE isRobot=false) AS \"Human Edits\",\n",
+    "    COUNT (DISTINCT user) FILTER (WHERE isRobot=false) AS \"Human 
Editors\"\n",
+    "FROM unifiedSource\n",
+    "GROUP BY 1\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f58a1846-5072-4495-b840-a620de3c0442",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "* There are two modes for `UNION ALL` in Druid - top level and table 
level\n",
+    "* Top level is a simple concatenation, and operations must be done on the 
source `TABLE`s\n",
+    "* Table level uses a `union` data source, and operations must be done on 
the outer `SELECT`\n",

Review Comment:
   Thank you! I ended up removing the trap that generates the error condition, 
and pulling a lot back up under the main headings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] 202307-notebook-unionall (druid)

Reply via email to