This is an automated email from the ASF dual-hosted git repository.
techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new f585f0a8ed 202306-docs-notebook topn (#14478)
f585f0a8ed is described below
commit f585f0a8ed96b03140bbbcff8ec01283bafaa2f5
Author: Peter Marshall <[email protected]>
AuthorDate: Wed Aug 16 22:50:49 2023 +0100
202306-docs-notebook topn (#14478)
Co-authored-by: Charles Smith <[email protected]>
---
.../notebooks/03-query/02-approxRanking.ipynb | 596 +++++++++++++++++++++
1 file changed, 596 insertions(+)
diff --git
a/examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb
b/examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb
new file mode 100644
index 0000000000..e49ac92e84
--- /dev/null
+++
b/examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb
@@ -0,0 +1,596 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using TopN approximation in Druid queries\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "Imagine you’re building a dynamic filter in your app: you want to
populate it with, say, the top most popular (COUNT) dimension values in
descending order (ORDER BY). Druid speeds up this type of query using TopN
approximation by default. In this tutorial, work through some examples and see
the effect of turning approximation off."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c94ff5c9-ada9-4f1d-8541-649e70cfc9a3",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run without using Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Apache Druid instance, with a `DRUID_HOST` local environment
variable containing the server name of your Druid router.\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file.\n",
+ "* [matplotlib](https://matplotlib.org/), a library for creating
visualizations in Python.\n",
+ "* [pandas](https://pandas.pydata.org/), a data analysis and manipulation
tool."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6b56cfc-9951-4a4e-a3f4-828e2dd5b3b5",
+ "metadata": {},
+ "source": [
+ "### Initialize Python\n",
+ "\n",
+ "Run the next cell to set up the Druid Python client's connection to
Apache Druid.\n",
+ "\n",
+ "If successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "685b872e-0d59-4100-a636-39ec93c627fb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d6fe9c99-ee0d-4205-9ca5-a8810c977335",
+ "metadata": {},
+ "source": [
+ "### Load example data\n",
+ "\n",
+ "Once your Druid environment is up and running, ingest the sample data for
this tutorial.\n",
+ "\n",
+ "Run the following cell to create a table called `example-flights-topn`.
When completed, you'll see a description of the final table.\n",
+ "\n",
+ "Monitor the ingestion task process in the Druid console."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e89a3000-a65e-4c4a-a917-3c37cbe975b3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-flights-topn\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip\"]}',\n",
+ " '{\"type\":\"csv\",\"findColumnsFromHeader\":true}'\n",
+ " )\n",
+ ") EXTEND (\"depaturetime\" VARCHAR, \"arrivalime\" VARCHAR, \"Year\"
BIGINT, \"Quarter\" BIGINT, \"Month\" BIGINT, \"DayofMonth\" BIGINT,
\"DayOfWeek\" BIGINT, \"FlightDate\" VARCHAR, \"Reporting_Airline\" VARCHAR,
\"DOT_ID_Reporting_Airline\" BIGINT, \"IATA_CODE_Reporting_Airline\" VARCHAR,
\"Tail_Number\" VARCHAR, \"Flight_Number_Reporting_Airline\" BIGINT,
\"OriginAirportID\" BIGINT, \"OriginAirportSeqID\" BIGINT,
\"OriginCityMarketID\" BIGINT, \"Origin\" VARCHAR, \"OriginCityNam [...]
+ "SELECT\n",
+ " TIME_PARSE(\"depaturetime\") AS \"__time\",\n",
+ " \"Reporting_Airline\",\n",
+ " \"Tail_Number\",\n",
+ " \"Distance\",\n",
+ " \"Origin\",\n",
+ " \"Flight_Number_Reporting_Airline\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-flights-topn')\n",
+ "display.table('example-flights-topn')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "00141575-29b4-440e-b23f-f7c6b237ef28",
+ "metadata": {},
+ "source": [
+ "When this is completed, run the following cell to load some Python
libraries we need to explore what TopN does."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Example TopN style queries\n",
+ "\n",
+ "Druid looks for patterns in `SELECT` statements to determine if they
would benefit from using approximation. A ranking query, like the following,
matches the rules for TopN approximation, so Druid enables it by default.\n",
+ "\n",
+ "For Druid to automatically optimize for TopN, you need an SQL statement
that has:\n",
+ "* A GROUP BY on one dimension, and\n",
+ "* an ORDER BY on one aggregate.\n",
+ "\n",
+ "Run this query to see what the results are like:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT\n",
+ " \"Reporting_Airline\",\n",
+ " COUNT(*) AS Flights,\n",
+ " SUM(\"Distance\") AS SumDistance\n",
+ "FROM\n",
+ " \"example-flights-topn\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2 DESC\n",
+ "LIMIT 10\n",
+ "'''\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5600fc48-c999-406f-800b-3f0f6a973aa0",
+ "metadata": {},
+ "source": [
+ "Run the following cell, which uses the `explain_sql` method to show the
[`EXPLAIN
PLAN`](https://druid.apache.org/docs/latest/querying/sql-translation#interpreting-explain-plan-output)
for this query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7595eec0-a709-4cd6-985e-eec8a6e37b61",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']),
indent=2))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8658b26e-2f09-4a97-96e8-589168130559",
+ "metadata": {},
+ "source": [
+ "The plan `queryType` is `topN`, showing that TopN approximation was
used.\n",
+ "\n",
+ "Druid applied a `LIMIT` operation on the results calculated by each data
service involved in the query, improving processing efficiency by minimizing
the amount of data transferred to the Broker.\n",
+ "\n",
+ "This
[pushed-down](https://druid.apache.org/docs/latest/querying/groupbyquery#limit-pushdown-optimization)
`LIMIT` is the `max` of the `threshold` in the plan (which came from the
`LIMIT` in the SQL) and the
[`minTopNThreshold`](https://druid.apache.org/docs/latest/querying/topnquery.html#aliasing)
setting in your cluster - the default being 1,000.\n",
+ "\n",
+ "To see the implication of this `LIMIT` in action, the cardinality of the
`GROUP BY` dimension therefore needs to exceed this cap.\n",
+ "\n",
+ "Run the following query to discover the cardinality of the `GROUP BY` on
`Reporting_Airline`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "65a968e5-d51e-47e9-af04-88181f3b865b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT COUNT (DISTINCT \"Reporting_Airline\") AS
UniqueReportingAirlines\n",
+ "FROM \"example-flights-topn\"\n",
+ "'''\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "13c8a101-dcba-49be-8d05-0a5dbd9731ca",
+ "metadata": {},
+ "source": [
+ "The number of unique values is below the `LIMIT` cap, meaning, there is
no trimming and the results are not approximate; all the data servers will
return all their results, without trimming, to be merged and passed back to
us.\n",
+ "\n",
+ "What is the cardinality for the `Tail_Number` dimension?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "71c1816b-f090-4a3d-b476-8d40eb9c2dec",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT\n",
+ " COUNT (DISTINCT \"Tail_Number\") AS UniqueTailNumbers\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Tail_Number\" <> ''\n",
+ "'''\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "72a1a858-bda8-464e-988b-c4ed80b63f43",
+ "metadata": {},
+ "source": [
+ "With this many distinct values to `GROUP BY`, the services involved in
the query will trim their results when using the\n",
+ "`topN` engine.\n",
+ "\n",
+ "Run the next query to visualize the distribution of unique `Tail_Number`s
in the example dataset."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "208f1463-34dd-4b0e-aa78-e582e2133a8f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT\n",
+ " \"Tail_Number\",\n",
+ " COUNT(*) AS RecordCount\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Tail_Number\" <> ''\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2 DESC\n",
+ "LIMIT 500\n",
+ "'''\n",
+ "\n",
+ "df4 = pd.DataFrame(sql_client.sql(sql))\n",
+ "\n",
+ "df4.plot(x='Tail_Number', y='RecordCount', marker='o')\n",
+ "plt.xticks(rotation=45, ha='right')\n",
+ "plt.gca().get_legend().remove()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b16d9a4d-c9e7-447b-8cdb-7ad1c0f88d73",
+ "metadata": {},
+ "source": [
+ "The plot shows that we have a long tail distribution, meaning there is a
high likelihood the same `Tail_Number` will be in rank position one across the
data set, and therefore across all segments. The flatter the distribution, the
less reliable this assertion is.\n",
+ "\n",
+ "Take a look at the following cell to see a query that counts the number
of records and sums total distance for each `Tail_Number`.\n",
+ "\n",
+ "Run the cell to execute this query in both TopN and non-TopN modes. The
first run puts the results into a Dataframe `df1` running `sql_client.sql(sql)`
directly. The second uses a crafted `req` object that adds the
`useApproximateTopN` query context parameter to turn off approximation, storing
the results in `df2`.\n",
+ "\n",
+ "It then runs a `compare` of `df2` against `df1` using `df3` and prints
the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "71db4746-4e8a-447e-aa58-f4c4ce3d7ffc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT\n",
+ " \"Tail_Number\",\n",
+ " COUNT(*) AS \"count\",\n",
+ " SUM(Distance) AS \"distance\"\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Tail_Number\" IS NOT NULL\n",
+ "GROUP BY 1\n",
+ "ORDER BY 3 DESC\n",
+ "LIMIT 500\n",
+ "'''\n",
+ "\n",
+ "# Load the results into a pandas DataFrame\n",
+ "\n",
+ "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+ "\n",
+ "# Set up a sql_request to turn off TopN approximation\n",
+ "\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_context(\"useApproximateTopN\", \"false\")\n",
+ "resp = sql_client.sql_query(req)\n",
+ "\n",
+ "# Load the non-TopN results into a second pandas DataFrame using that
request\n",
+ "\n",
+ "df2 = pd.DataFrame(sql_client.sql_query(req).rows)\n",
+ "\n",
+ "# Load the compare of df1 to df2 into a new dataframe and print\n",
+ "\n",
+ "df3 = df1.compare(df2, keep_equal=True)\n",
+ "df3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78836242-acc8-4403-9e96-2177b96110ed",
+ "metadata": {},
+ "source": [
+ "You can see:\n",
+ "\n",
+ "* The `self` (df1) and `other` (df2) rank position of each `Tail_Number`
in each position\n",
+ "* The self / other values for the calculated `count` and `distance`\n",
+ "\n",
+ "You may notice some `Tail_Number`s are in different positions depending
on what the calculated `distance` is: certain data servers returned different
sets of results, depending entirely on local data distribution. And some
`Tail_Number`s may not appear in the list at all as they drop below the fold
the cut-off applied to that specific process.\n",
+ "\n",
+ "Let's try this with a different dimension,
`Flight_Number_Reporting_Airline`. The example dataset has more unique values,
but the distribution is much flatter than `Tail_Number`. Run the following cell
to see the count and a distribution plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a96f924c-9fc1-4000-9a54-7a951db5d2bb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT COUNT(DISTINCT \"Flight_Number_Reporting_Airline\") AS
UniqueReportingAirlines\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Flight_Number_Reporting_Airline\" <> ''\n",
+ "'''\n",
+ "\n",
+ "display.sql(sql)\n",
+ "\n",
+ "sql = '''\n",
+ "SELECT \"Flight_Number_Reporting_Airline\", COUNT(*) AS Flights\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Flight_Number_Reporting_Airline\" <> ''\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2 DESC\n",
+ "LIMIT 500\n",
+ "'''\n",
+ "\n",
+ "# Load the results into a pandas DataFrame\n",
+ "\n",
+ "df5 = pd.DataFrame(sql_client.sql(sql))\n",
+ "\n",
+ "# Display a plot\n",
+ "\n",
+ "df5.plot(x='Flight_Number_Reporting_Airline', y='Flights', kind=\"bar\",
xticks=[])\n",
+ "plt.gca().get_legend().remove()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4866091d-e689-4209-8f6e-4edd526646e9",
+ "metadata": {},
+ "source": [
+ "This dimension, unlike `Tail_Number`, has a flatter distribution. Each
data process is likely to have a flatter distribution of data, too, meaning the
top ranking results are less prominent. The \"voting\" across the servers as to
what is in the top is less clear.\n",
+ "\n",
+ "Run the following cell to repeat the same test we did before, creating
two sets of results, and comparing them."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "405f7a9b-ac13-4c13-8e30-42058df4cbce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT\n",
+ " \"Flight_Number_Reporting_Airline\",\n",
+ " AVG(\"Distance\") AS AverageDistance\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Flight_Number_Reporting_Airline\" IS NOT NULL\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2 DESC\n",
+ "LIMIT 10\n",
+ "'''\n",
+ "\n",
+ "# Set up a sql_request to turn off TopN approximation\n",
+ "\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_context(\"useApproximateTopN\", \"false\")\n",
+ "resp = sql_client.sql_query(req)\n",
+ "\n",
+ "# Load two pandas DataFrames - one with the TopN and one with the
non-TopN results\n",
+ "\n",
+ "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+ "df2 = pd.DataFrame(sql_client.sql_query(req).rows)\n",
+ "\n",
+ "# Load the compare of df1 to df2 into a new dataframe and print\n",
+ "\n",
+ "df3 = df1.compare(df2, keep_equal=True)\n",
+ "df3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "989a3e1c-cc8a-47c9-ad15-0b95fa00c7a6",
+ "metadata": {},
+ "source": [
+ "Here, the flatter distribution exaggerates ranking and calculation error.
Further issues are caused by the calculation being non-additive.\n",
+ "\n",
+ "The following cell contains a query that is a good example of TopN being
applied: it creates a list of `Tail_Number`s within a particular period of
time. Imagine that you might use this list to provide an interactive filter on
`Tail_Number` to the end user when they're looking at this specific time
period.\n",
+ "\n",
+ "Run the following cell to show the cardinality of `Tail_Number`s in that
period, and then to plot the distribution:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d039c393-96f4-4847-ac60-4414477ebc3b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT COUNT (DISTINCT \"Tail_Number\") AS UniqueTailNumbers\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Tail_Number\" <> ''\n",
+ "AND (TIMESTAMP '2005-11-01' <= \"__time\" AND \"__time\" <= TIMESTAMP
'2005-11-14')\n",
+ "'''\n",
+ "display.sql(sql)\n",
+ "\n",
+ "sql = '''\n",
+ "SELECT\n",
+ " \"Tail_Number\",\n",
+ " COUNT(*) AS \"Flights\"\n",
+ "FROM \"example-flights-topn\"\n",
+ "WHERE \"Tail_Number\" <> ''\n",
+ "AND (TIMESTAMP '2005-11-01' <= \"__time\" AND \"__time\" <= TIMESTAMP
'2005-11-14')\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2 DESC\n",
+ "LIMIT 500\n",
+ "'''\n",
+ "\n",
+ "df4 = pd.DataFrame(sql_client.sql(sql))\n",
+ "\n",
+ "df4.plot(x='Tail_Number', y='Flights', marker='o')\n",
+ "plt.xticks(rotation=45, ha='right')\n",
+ "plt.gca().get_legend().remove()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "eeed8fa8-d1ce-41b2-955b-88fb0834ab36",
+ "metadata": {},
+ "source": [
+ "This distribution pattern is good for TopN - the highest ranking values
are very prominent.\n",
+ "\n",
+ "Run the following cell to compare the two styles of execution:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d47d2017-1d89-4622-a42c-d86f29a774e1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT \"Tail_Number\", COUNT(*) AS \"count\", SUM(Distance) AS
\"distance\"\n",
+ " FROM \"example-flights-topn\"\n",
+ " WHERE \"Tail_Number\" IS NOT NULL\n",
+ " AND (TIMESTAMP '2005-11-01' <= \"__time\" AND \"__time\" <= TIMESTAMP
'2005-11-14')\n",
+ " GROUP BY 1\n",
+ " ORDER BY 3 DESC\n",
+ " LIMIT 500\n",
+ "'''\n",
+ "\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_context(\"useApproximateTopN\", \"false\")\n",
+ "resp = sql_client.sql_query(req)\n",
+ "\n",
+ "df1 = pd.DataFrame(sql_client.sql(sql))\n",
+ "df2 = pd.DataFrame(sql_client.sql_query(req).rows)\n",
+ "\n",
+ "df3 = df1.compare(df2, keep_equal=True)\n",
+ "df3"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f58a1846-5072-4495-b840-a620de3c0442",
+ "metadata": {},
+ "source": [
+ "The distribution, together with our filters, means that these results are
useful for this kind of interactive UI element."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b43cd060-429e-4e84-b559-ad63624e7c14",
+ "metadata": {},
+ "source": [
+ "## Summary\n",
+ "\n",
+ "The speed boost we receive through TopN, at the expense of some accuracy,
makes it useful for interactive elements like filters or initial lists of
results that people will then deep dive into.\n",
+ "\n",
+ "* TopN is the default execution model for `GROUP BY` queries with one
dimension, an `ORDER BY` and a `LIMIT` clause\n",
+ "* You can turn TopN off with a query context parameter\n",
+ "* Accuracy is highly dependent on distribution of the data, after filters
etc., across the database"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d08b8804-1051-4d38-88e7-2be1776934eb",
+ "metadata": {},
+ "source": [
+ "## Learn more\n",
+ "\n",
+ "Read the following documentation for more information:\n",
+ "\n",
+ "* [TopN queries](https://druid.apache.org/docs/latest/querying/topnquery)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]