petermarshallio commented on code in PR #14478:
URL: https://github.com/apache/druid/pull/14478#discussion_r1295505318
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/02-approxRanking.ipynb:
##########
@@ -0,0 +1,577 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "# Using TopN approximation in Druid queries\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "Imagine you’re building a dynamic filter in your app: you want to
populate it with, say, the top most popular (COUNT) dimension values in
descending order (ORDER BY). Druid speeds up this type of query using TopN
approximation by default. In this tutorial, work through some examples and see
the effect of turning approximation off."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c94ff5c9-ada9-4f1d-8541-649e70cfc9a3",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run without using Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Apache Druid instance, with a `DRUID_HOST` local environment
variable containing the server name of your Druid router.\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file.\n",
+ "* [matplotlib](https://matplotlib.org/), a library for creating
visualizations in Python.\n",
+ "* [pandas](https://pandas.pydata.org/), a data analysis and manipulation
tool."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e6b56cfc-9951-4a4e-a3f4-828e2dd5b3b5",
+ "metadata": {},
+ "source": [
+ "### Initialize Python\n",
+ "\n",
+ "Run the next cell to set up the Druid Python client's connection to
Apache Druid.\n",
+ "\n",
+ "If successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "685b872e-0d59-4100-a636-39ec93c627fb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d6fe9c99-ee0d-4205-9ca5-a8810c977335",
+ "metadata": {},
+ "source": [
+ "### Load example data\n",
+ "\n",
+ "Once your Druid environment is up and running, ingest the sample data for
this tutorial.\n",
+ "\n",
+ "Run the following cell to create a table called `example-flights-topn`.
When completed, you'll see a description of the final table.\n",
+ "\n",
+ "Monitor the ingestion task process in the Druid console."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e89a3000-a65e-4c4a-a917-3c37cbe975b3",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-flights-topn\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip\"]}',\n",
+ " '{\"type\":\"csv\",\"findColumnsFromHeader\":true}'\n",
+ " )\n",
+ ") EXTEND (\"depaturetime\" VARCHAR, \"arrivalime\" VARCHAR, \"Year\"
BIGINT, \"Quarter\" BIGINT, \"Month\" BIGINT, \"DayofMonth\" BIGINT,
\"DayOfWeek\" BIGINT, \"FlightDate\" VARCHAR, \"Reporting_Airline\" VARCHAR,
\"DOT_ID_Reporting_Airline\" BIGINT, \"IATA_CODE_Reporting_Airline\" VARCHAR,
\"Tail_Number\" VARCHAR, \"Flight_Number_Reporting_Airline\" BIGINT,
\"OriginAirportID\" BIGINT, \"OriginAirportSeqID\" BIGINT,
\"OriginCityMarketID\" BIGINT, \"Origin\" VARCHAR, \"OriginCityName\" VARCHAR,
\"OriginState\" VARCHAR, \"OriginStateFips\" BIGINT, \"OriginStateName\"
VARCHAR, \"OriginWac\" BIGINT, \"DestAirportID\" BIGINT, \"DestAirportSeqID\"
BIGINT, \"DestCityMarketID\" BIGINT, \"Dest\" VARCHAR, \"DestCityName\"
VARCHAR, \"DestState\" VARCHAR, \"DestStateFips\" BIGINT, \"DestStateName\"
VARCHAR, \"DestWac\" BIGINT, \"CRSDepTime\" BIGINT, \"DepTime\" BIGINT,
\"DepDelay\" BIGINT, \"DepDelayMinutes\" BIGINT, \"DepDel15\" BIGINT,
\"DepartureDelayGroups\" BIGINT, \"DepTimeBlk\" VAR
CHAR, \"TaxiOut\" BIGINT, \"WheelsOff\" BIGINT, \"WheelsOn\" BIGINT,
\"TaxiIn\" BIGINT, \"CRSArrTime\" BIGINT, \"ArrTime\" BIGINT, \"ArrDelay\"
BIGINT, \"ArrDelayMinutes\" BIGINT, \"ArrDel15\" BIGINT, \"ArrivalDelayGroups\"
BIGINT, \"ArrTimeBlk\" VARCHAR, \"Cancelled\" BIGINT, \"CancellationCode\"
VARCHAR, \"Diverted\" BIGINT, \"CRSElapsedTime\" BIGINT, \"ActualElapsedTime\"
BIGINT, \"AirTime\" BIGINT, \"Flights\" BIGINT, \"Distance\" BIGINT,
\"DistanceGroup\" BIGINT, \"CarrierDelay\" BIGINT, \"WeatherDelay\" BIGINT,
\"NASDelay\" BIGINT, \"SecurityDelay\" BIGINT, \"LateAircraftDelay\" BIGINT,
\"FirstDepTime\" VARCHAR, \"TotalAddGTime\" VARCHAR, \"LongestAddGTime\"
VARCHAR, \"DivAirportLandings\" VARCHAR, \"DivReachedDest\" VARCHAR,
\"DivActualElapsedTime\" VARCHAR, \"DivArrDelay\" VARCHAR, \"DivDistance\"
VARCHAR, \"Div1Airport\" VARCHAR, \"Div1AirportID\" VARCHAR,
\"Div1AirportSeqID\" VARCHAR, \"Div1WheelsOn\" VARCHAR, \"Div1TotalGTime\"
VARCHAR, \"Div1LongestGTime\" VARCHAR, \"Div
1WheelsOff\" VARCHAR, \"Div1TailNum\" VARCHAR, \"Div2Airport\" VARCHAR,
\"Div2AirportID\" VARCHAR, \"Div2AirportSeqID\" VARCHAR, \"Div2WheelsOn\"
VARCHAR, \"Div2TotalGTime\" VARCHAR, \"Div2LongestGTime\" VARCHAR,
\"Div2WheelsOff\" VARCHAR, \"Div2TailNum\" VARCHAR, \"Div3Airport\" VARCHAR,
\"Div3AirportID\" VARCHAR, \"Div3AirportSeqID\" VARCHAR, \"Div3WheelsOn\"
VARCHAR, \"Div3TotalGTime\" VARCHAR, \"Div3LongestGTime\" VARCHAR,
\"Div3WheelsOff\" VARCHAR, \"Div3TailNum\" VARCHAR, \"Div4Airport\" VARCHAR,
\"Div4AirportID\" VARCHAR, \"Div4AirportSeqID\" VARCHAR, \"Div4WheelsOn\"
VARCHAR, \"Div4TotalGTime\" VARCHAR, \"Div4LongestGTime\" VARCHAR,
\"Div4WheelsOff\" VARCHAR, \"Div4TailNum\" VARCHAR, \"Div5Airport\" VARCHAR,
\"Div5AirportID\" VARCHAR, \"Div5AirportSeqID\" VARCHAR, \"Div5WheelsOn\"
VARCHAR, \"Div5TotalGTime\" VARCHAR, \"Div5LongestGTime\" VARCHAR,
\"Div5WheelsOff\" VARCHAR, \"Div5TailNum\" VARCHAR, \"Unnamed: 109\"
VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"depaturetime\") AS \"__time\",\n",
+ " \"Reporting_Airline\",\n",
+ " \"Tail_Number\",\n",
+ " \"Distance\",\n",
+ " \"Origin\",\n",
+ " \"Flight_Number_Reporting_Airline\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-flights-topn')\n",
+ "display.table('example-flights-topn')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "00141575-29b4-440e-b23f-f7c6b237ef28",
+ "metadata": {},
+ "source": [
+ "When this is completed, run the following cell for the final part of the
initialization. This will provide us some methods to call as we explore what
TopN does."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f388633f-195b-4381-98cc-7a2f80f48690",
+ "metadata": {},
+ "source": [
+ "## Example TopN style queries\n",
+ "\n",
+ "Druid looks for patterns in `SELECT` statements to determine if they
would benefit from using approximation. A ranking query, like the one below,
matches the rules for TopN approximation, so Druid enables it by default.\n",
+ "\n",
+ "To see this happen, we need an SQL statement that has:\n",
+ "* A GROUP BY on one dimension, and\n",
+ "* an ORDER BY on one aggregate.\n",
+ "\n",
+ "Run this query to see what the results are like:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b76e5184-9fe4-4f21-a471-4e15d16515c8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT\n",
+ " \"Reporting_Airline\",\n",
+ " COUNT(*) AS Flights,\n",
+ " SUM(\"Distance\") AS SumDistance\n",
+ "FROM\n",
+ " \"example-flights-topn\"\n",
+ "GROUP BY 1\n",
+ "ORDER BY 2 DESC\n",
+ "LIMIT 10\n",
+ "'''\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5600fc48-c999-406f-800b-3f0f6a973aa0",
+ "metadata": {},
+ "source": [
+ "You can use `EXPLAIN PLAN FOR` or the `explain_sql` method to see whether
Druid used TopN approximation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7595eec0-a709-4cd6-985e-eec8a6e37b61",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(json.dumps(json.loads(sql_client.explain_sql(sql)['PLAN']),
indent=2))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8658b26e-2f09-4a97-96e8-589168130559",
+ "metadata": {},
+ "source": [
+ "You know approximation is used when the `queryType` is `topN`.\n",
+ "\n",
+ "For these queries, Druid applies a `LIMIT` operation on the results
calculated by each server that’s been called upon to answer the query, not just
on the final result set. This sees less data bubbling up from each process to
be merged overall, and therefore greater efficiency in execution.\n",
Review Comment:
Committing this change - will update in a future commit.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]