[druid] branch master updated: 202306-docs-notebook topn (#14478)

techdocsmith Wed, 16 Aug 2023 14:51:01 -0700

This is an automated email from the ASF dual-hosted git repository.

techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new f585f0a8ed 202306-docs-notebook topn (#14478)
f585f0a8ed is described below

commit f585f0a8ed96b03140bbbcff8ec01283bafaa2f5
Author: Peter Marshall <[email protected]>
AuthorDate: Wed Aug 16 22:50:49 2023 +0100

    202306-docs-notebook topn (#14478)
    
    Co-authored-by: Charles Smith <[email protected]>
---
 .../notebooks/03-query/02-approxRanking.ipynb      | 596 +++++++++++++++++++++
 1 file changed, 596 insertions(+)

diff --git 
a/examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb
 
b/examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb
new file mode 100644
index 0000000000..e49ac92e84
--- /dev/null
+++ 
b/examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb
@@ -0,0 +1,596 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "# Using TopN approximation in Druid queries\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "Imagine you’re building a dynamic filter in your app: you want to 
populate it with, say, the top most popular (COUNT) dimension values in 
descending order (ORDER BY). Druid speeds up this type of query using TopN 
approximation by default. In this tutorial, work through some examples and see 
the effect of turning approximation off."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c94ff5c9-ada9-4f1d-8541-649e70cfc9a3",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "#### Run using Docker\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "   \n",
+    "#### Run without using Docker\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Apache Druid instance, with a `DRUID_HOST` local environment 
variable containing the server name of your Druid router.\n",
+    "* 
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
 a Python client for Apache Druid. Follow the instructions in the Install 
section of the README file.\n",
+    "* [matplotlib](https://matplotlib.org/), a library for creating 
visualizations in Python.\n",
+    "* [pandas](https://pandas.pydata.org/), a data analysis and manipulation 
tool."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e6b56cfc-9951-4a4e-a3f4-828e2dd5b3b5",
+   "metadata": {},
+   "source": [
+    "### Initialize Python\n",
+    "\n",
+    "Run the next cell to set up the Druid Python client's connection to 
Apache Druid.\n",
+    "\n",
+    "If successful, the Druid version number will be shown in the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "685b872e-0d59-4100-a636-39ec93c627fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import os\n",
+    "\n",
+    "if 'DRUID_HOST' not in os.environ.keys():\n",
+    "    druid_host=f\"http://localhost:8888\"\n";,
+    "else:\n",
+    "    druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+    "    \n",
+    "print(f\"Opening a connection to {druid_host}.\")\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "status_client = druid.status\n",
+    "\n",
+    "status_client.version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d6fe9c99-ee0d-4205-9ca5-a8810c977335",
+   "metadata": {},
+   "source": [
+    "### Load example data\n",
+    "\n",
+    "Once your Druid environment is up and running, ingest the sample data for 
this tutorial.\n",
+    "\n",
+    "Run the following cell to create a table called `example-flights-topn`.  
When completed, you'll see a description of the final table.\n",
+    "\n",
+    "Monitor the ingestion task process in the Druid console."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e89a3000-a65e-4c4a-a917-3c37cbe975b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "REPLACE INTO \"example-flights-topn\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip\"]}',\n",
+    "    '{\"type\":\"csv\",\"findColumnsFromHeader\":true}'\n",
+    "  )\n",
+    ") EXTEND (\"depaturetime\" VARCHAR, \"arrivalime\" VARCHAR, \"Year\" 
BIGINT, \"Quarter\" BIGINT, \"Month\" BIGINT, \"DayofMonth\" BIGINT, 
\"DayOfWeek\" BIGINT, \"FlightDate\" VARCHAR, \"Reporting_Airline\" VARCHAR, 
\"DOT_ID_Reporting_Airline\" BIGINT, \"IATA_CODE_Reporting_Airline\" VARCHAR, 
\"Tail_Number\" VARCHAR, \"Flight_Number_Reporting_Airline\" BIGINT, 
\"OriginAirportID\" BIGINT, \"OriginAirportSeqID\" BIGINT, 
\"OriginCityMarketID\" BIGINT, \"Origin\" VARCHAR, \"OriginCityNam [...]
+    "SELECT\n",
+    "  TIME_PARSE(\"depaturetime\") AS \"__time\",\n",
+    "  \"Reporting_Airline\",\n",
+    "  \"Tail_Number\",\n",
+    "  \"Distance\",\n",
+    "  \"Origin\",\n",
+    "  \"Flight_Number_Reporting_Airline\"\n",
+    "FROM \"ext\"\n",
+    "PARTITIONED BY DAY\n",
+    "'''\n",
+    "\n",
+    "sql_client.run_task(sql)\n",
+    "sql_client.wait_until_ready('example-flights-topn')\n",
+    "display.table('example-flights-topn')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00141575-29b4-440e-b23f-f7c6b237ef28",
+   "metadata": {},
+   "source": [
+    "When this is completed, run the following cell to load some Python 
libraries we need to explore what TopN does."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+   "metadata": {},
+   "source": [
+    "## Example TopN style queries\n",
+    "\n",
+    "Druid looks for patterns in `SELECT` statements to determine if they 
would benefit from using approximation. A ranking query, like the following, 
matches the rules for TopN approximation, so Druid enables it by default.\n",
+    "\n",
+    "For Druid to automatically optimize for TopN, you need an SQL statement 
that has:\n",
+    "* A GROUP BY on one dimension, and\n",
+    "* an ORDER BY on one aggregate.\n",
+    "\n",
+    "Run this query to see what the results are like:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT\n",
+    "    \"Reporting_Airline\",\n",
+    "    COUNT(*) AS Flights,\n",
+    "    SUM(\"Distance\") AS SumDistance\n",
+    "FROM\n",
+    "    \"example-flights-topn\"\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2 DESC\n",
+    "LIMIT 10\n",
+    "'''\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5600fc48-c999-406f-800b-3f0f6a973aa0",
+   "metadata": {},
+   "source": [
+    "Run the following cell, which uses the `explain_sql` method to show the 
[`EXPLAIN 
PLAN`](https://druid.apache.org/docs/latest/querying/sql-translation#interpreting-explain-plan-output)
 for this query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7595eec0-a709-4cd6-985e-eec8a6e37b61",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']), 
indent=2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8658b26e-2f09-4a97-96e8-589168130559",
+   "metadata": {},
+   "source": [
+    "The plan `queryType` is `topN`, showing that TopN approximation was 
used.\n",
+    "\n",
+    "Druid applied a `LIMIT` operation on the results calculated by each data 
service involved in the query, improving processing efficiency by minimizing 
the amount of data transferred to the Broker.\n",
+    "\n",
+    "This 
[pushed-down](https://druid.apache.org/docs/latest/querying/groupbyquery#limit-pushdown-optimization)
 `LIMIT` is the `max` of the `threshold` in the plan (which came from the 
`LIMIT` in the SQL) and the 
[`minTopNThreshold`](https://druid.apache.org/docs/latest/querying/topnquery.html#aliasing)
 setting in your cluster - the default being 1,000.\n",
+    "\n",
+    "To see the implication of this `LIMIT` in action, the cardinality of the 
`GROUP BY` dimension therefore needs to exceed this cap.\n",
+    "\n",
+    "Run the following query to discover the cardinality of the `GROUP BY` on 
`Reporting_Airline`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65a968e5-d51e-47e9-af04-88181f3b865b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT COUNT (DISTINCT \"Reporting_Airline\") AS 
UniqueReportingAirlines\n",
+    "FROM \"example-flights-topn\"\n",
+    "'''\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13c8a101-dcba-49be-8d05-0a5dbd9731ca",
+   "metadata": {},
+   "source": [
+    "The number of unique values is below the `LIMIT` cap, meaning, there is 
no trimming and the results are not approximate; all the data servers will 
return all their results, without trimming, to be merged and passed back to 
us.\n",
+    "\n",
+    "What is the cardinality for the `Tail_Number` dimension?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "71c1816b-f090-4a3d-b476-8d40eb9c2dec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT\n",
+    "    COUNT (DISTINCT \"Tail_Number\") AS UniqueTailNumbers\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Tail_Number\" <> ''\n",
+    "'''\n",
+    "display.sql(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72a1a858-bda8-464e-988b-c4ed80b63f43",
+   "metadata": {},
+   "source": [
+    "With this many distinct values to `GROUP BY`, the services involved in 
the query will trim their results when using the\n",
+    "`topN` engine.\n",
+    "\n",
+    "Run the next query to visualize the distribution of unique `Tail_Number`s 
in the example dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "208f1463-34dd-4b0e-aa78-e582e2133a8f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT\n",
+    "    \"Tail_Number\",\n",
+    "    COUNT(*) AS RecordCount\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Tail_Number\" <> ''\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2 DESC\n",
+    "LIMIT 500\n",
+    "'''\n",
+    "\n",
+    "df4 = pd.DataFrame(sql_client.sql(sql))\n",
+    "\n",
+    "df4.plot(x='Tail_Number', y='RecordCount', marker='o')\n",
+    "plt.xticks(rotation=45, ha='right')\n",
+    "plt.gca().get_legend().remove()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b16d9a4d-c9e7-447b-8cdb-7ad1c0f88d73",
+   "metadata": {},
+   "source": [
+    "The plot shows that we have a long tail distribution, meaning there is a 
high likelihood the same `Tail_Number` will be in rank position one across the 
data set, and therefore across all segments. The flatter the distribution, the 
less reliable this assertion is.\n",
+    "\n",
+    "Take a look at the following cell to see a query that counts the number 
of records and sums total distance for each `Tail_Number`.\n",
+    "\n",
+    "Run the cell to execute this query in both TopN and non-TopN modes. The 
first run puts the results into a Dataframe `df1` running `sql_client.sql(sql)` 
directly. The second uses a crafted `req` object that adds the 
`useApproximateTopN` query context parameter to turn off approximation, storing 
the results in `df2`.\n",
+    "\n",
+    "It then runs a `compare` of `df2` against `df1` using `df3` and prints 
the results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "71db4746-4e8a-447e-aa58-f4c4ce3d7ffc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT\n",
+    "    \"Tail_Number\",\n",
+    "    COUNT(*) AS \"count\",\n",
+    "    SUM(Distance) AS \"distance\"\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Tail_Number\" IS NOT NULL\n",
+    "GROUP BY 1\n",
+    "ORDER BY 3 DESC\n",
+    "LIMIT 500\n",
+    "'''\n",
+    "\n",
+    "# Load the results into a pandas DataFrame\n",
+    "\n",
+    "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+    "\n",
+    "# Set up a sql_request to turn off TopN approximation\n",
+    "\n",
+    "req = sql_client.sql_request(sql)\n",
+    "req.add_context(\"useApproximateTopN\", \"false\")\n",
+    "resp = sql_client.sql_query(req)\n",
+    "\n",
+    "# Load the non-TopN results into a second pandas DataFrame using that 
request\n",
+    "\n",
+    "df2 = pd.DataFrame(sql_client.sql_query(req).rows)\n",
+    "\n",
+    "# Load the compare of df1 to df2 into a new dataframe and print\n",
+    "\n",
+    "df3 = df1.compare(df2, keep_equal=True)\n",
+    "df3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78836242-acc8-4403-9e96-2177b96110ed",
+   "metadata": {},
+   "source": [
+    "You can see:\n",
+    "\n",
+    "* The `self` (df1) and `other` (df2) rank position of each `Tail_Number` 
in each position\n",
+    "* The self / other values for the calculated `count` and `distance`\n",
+    "\n",
+    "You may notice some `Tail_Number`s are in different positions depending 
on what the calculated `distance` is: certain data servers returned different 
sets of results, depending entirely on local data distribution. And some 
`Tail_Number`s may not appear in the list at all as they drop below the fold 
the cut-off applied to that specific process.\n",
+    "\n",
+    "Let's try this with a different dimension, 
`Flight_Number_Reporting_Airline`. The example dataset has more unique values, 
but the distribution is much flatter than `Tail_Number`. Run the following cell 
to see the count and a distribution plot."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a96f924c-9fc1-4000-9a54-7a951db5d2bb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT COUNT(DISTINCT \"Flight_Number_Reporting_Airline\") AS 
UniqueReportingAirlines\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Flight_Number_Reporting_Airline\" <> ''\n",
+    "'''\n",
+    "\n",
+    "display.sql(sql)\n",
+    "\n",
+    "sql = '''\n",
+    "SELECT \"Flight_Number_Reporting_Airline\", COUNT(*) AS Flights\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Flight_Number_Reporting_Airline\" <> ''\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2 DESC\n",
+    "LIMIT 500\n",
+    "'''\n",
+    "\n",
+    "# Load the results into a pandas DataFrame\n",
+    "\n",
+    "df5 = pd.DataFrame(sql_client.sql(sql))\n",
+    "\n",
+    "# Display a plot\n",
+    "\n",
+    "df5.plot(x='Flight_Number_Reporting_Airline', y='Flights', kind=\"bar\", 
xticks=[])\n",
+    "plt.gca().get_legend().remove()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4866091d-e689-4209-8f6e-4edd526646e9",
+   "metadata": {},
+   "source": [
+    "This dimension, unlike `Tail_Number`, has a flatter distribution. Each 
data process is likely to have a flatter distribution of data, too, meaning the 
top ranking results are less prominent. The \"voting\" across the servers as to 
what is in the top is less clear.\n",
+    "\n",
+    "Run the following cell to repeat the same test we did before, creating 
two sets of results, and comparing them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "405f7a9b-ac13-4c13-8e30-42058df4cbce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT\n",
+    "    \"Flight_Number_Reporting_Airline\",\n",
+    "    AVG(\"Distance\") AS AverageDistance\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Flight_Number_Reporting_Airline\" IS NOT NULL\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2 DESC\n",
+    "LIMIT 10\n",
+    "'''\n",
+    "\n",
+    "# Set up a sql_request to turn off TopN approximation\n",
+    "\n",
+    "req = sql_client.sql_request(sql)\n",
+    "req.add_context(\"useApproximateTopN\", \"false\")\n",
+    "resp = sql_client.sql_query(req)\n",
+    "\n",
+    "# Load two pandas DataFrames - one with the TopN and one with the 
non-TopN results\n",
+    "\n",
+    "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+    "df2 = pd.DataFrame(sql_client.sql_query(req).rows)\n",
+    "\n",
+    "# Load the compare of df1 to df2 into a new dataframe and print\n",
+    "\n",
+    "df3 = df1.compare(df2, keep_equal=True)\n",
+    "df3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "989a3e1c-cc8a-47c9-ad15-0b95fa00c7a6",
+   "metadata": {},
+   "source": [
+    "Here, the flatter distribution exaggerates ranking and calculation error. 
Further issues are caused by the calculation being non-additive.\n",
+    "\n",
+    "The following cell contains a query that is a good example of TopN being 
applied: it creates a list of `Tail_Number`s within a particular period of 
time. Imagine that you might use this list to provide an interactive filter on 
`Tail_Number` to the end user when they're looking at this specific time 
period.\n",
+    "\n",
+    "Run the following cell to show the cardinality of `Tail_Number`s in that 
period, and then to plot the distribution:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d039c393-96f4-4847-ac60-4414477ebc3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT COUNT (DISTINCT \"Tail_Number\") AS UniqueTailNumbers\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Tail_Number\" <> ''\n",
+    "AND (TIMESTAMP '2005-11-01' <= \"__time\" AND \"__time\" <= TIMESTAMP 
'2005-11-14')\n",
+    "'''\n",
+    "display.sql(sql)\n",
+    "\n",
+    "sql = '''\n",
+    "SELECT\n",
+    "    \"Tail_Number\",\n",
+    "    COUNT(*) AS \"Flights\"\n",
+    "FROM \"example-flights-topn\"\n",
+    "WHERE \"Tail_Number\" <> ''\n",
+    "AND (TIMESTAMP '2005-11-01' <= \"__time\" AND \"__time\" <= TIMESTAMP 
'2005-11-14')\n",
+    "GROUP BY 1\n",
+    "ORDER BY 2 DESC\n",
+    "LIMIT 500\n",
+    "'''\n",
+    "\n",
+    "df4 = pd.DataFrame(sql_client.sql(sql))\n",
+    "\n",
+    "df4.plot(x='Tail_Number', y='Flights', marker='o')\n",
+    "plt.xticks(rotation=45, ha='right')\n",
+    "plt.gca().get_legend().remove()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eeed8fa8-d1ce-41b2-955b-88fb0834ab36",
+   "metadata": {},
+   "source": [
+    "This distribution pattern is good for TopN - the highest ranking values 
are very prominent.\n",
+    "\n",
+    "Run the following cell to compare the two styles of execution:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d47d2017-1d89-4622-a42c-d86f29a774e1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT \"Tail_Number\", COUNT(*) AS \"count\", SUM(Distance) AS 
\"distance\"\n",
+    "    FROM \"example-flights-topn\"\n",
+    "    WHERE \"Tail_Number\" IS NOT NULL\n",
+    "    AND (TIMESTAMP '2005-11-01' <= \"__time\" AND \"__time\" <= TIMESTAMP 
'2005-11-14')\n",
+    "    GROUP BY 1\n",
+    "    ORDER BY 3 DESC\n",
+    "    LIMIT 500\n",
+    "'''\n",
+    "\n",
+    "req = sql_client.sql_request(sql)\n",
+    "req.add_context(\"useApproximateTopN\", \"false\")\n",
+    "resp = sql_client.sql_query(req)\n",
+    "\n",
+    "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+    "df2 = pd.DataFrame(sql_client.sql_query(req).rows)\n",
+    "\n",
+    "df3 = df1.compare(df2, keep_equal=True)\n",
+    "df3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f58a1846-5072-4495-b840-a620de3c0442",
+   "metadata": {},
+   "source": [
+    "The distribution, together with our filters, means that these results are 
useful for this kind of interactive UI element."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b43cd060-429e-4e84-b559-ad63624e7c14",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "The speed boost we receive through TopN, at the expense of some accuracy, 
makes it useful for interactive elements like filters or initial lists of 
results that people will then deep dive into.\n",
+    "\n",
+    "* TopN is the default execution model for `GROUP BY` queries with one 
dimension, an `ORDER BY` and a `LIMIT` clause\n",
+    "* You can turn TopN off with a query context parameter\n",
+    "* Accuracy is highly dependent on distribution of the data, after filters 
etc., across the database"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d08b8804-1051-4d38-88e7-2be1776934eb",
+   "metadata": {},
+   "source": [
+    "## Learn more\n",
+    "\n",
+    "Read the following documentation for more information:\n",
+    "\n",
+    "* [TopN queries](https://druid.apache.org/docs/latest/querying/topnquery)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: 202306-docs-notebook topn (#14478)

Reply via email to