petermarshallio commented on code in PR #14849:
URL: https://github.com/apache/druid/pull/14849#discussion_r1309767006
##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-dataDistribution.ipynb:
##########
@@ -0,0 +1,606 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "4188e2c5-4ab0-45e3-9371-52d510a97413",
+ "metadata": {},
+ "source": [
+ "# Analysing data distributions\n",
+ "\n",
+ "The Druid database provides functions to analyse large distributions
thanks to the Apache
[Datasketch](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html)
project's
[Quantiles](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html)
sketch, as well as
[t-digest](https://druid.apache.org/docs/latest/development/extensions-contrib/tdigestsketch-quantiles/),
[momentSketch](https://druid.apache.org/docs/latest/development/extensions-contrib/momentsketch-quantiles/),
and
[approxHistogram](https://druid.apache.org/docs/latest/development/extensions-core/approximate-histograms/).\n",
+ "\n",
+ "In this tutorial, work through some examples of the functions available
in the Apache Datasketches ezxtension, a core extension that enables estimation
of quantiles, ranks, and histograms."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "#### Run using Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ " \n",
+ "#### Run without using Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Apache Druid instance, with a `DRUID_HOST` local environment
variable containing the server name of your Druid router.\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file.\n",
+ "* [matplotlib](https://matplotlib.org/), a library for creating
visualizations in Python.\n",
+ "* [pandas](https://pandas.pydata.org/), a data analysis and manipulation
tool."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d3317c51-627c-4a44-ad73-0860a5f4c937",
+ "metadata": {},
+ "source": [
+ "### Initialize Python\n",
+ "\n",
+ "Run the next cell to set up the Druid Python client's connection to
Apache Druid.\n",
+ "\n",
+ "If successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fee22cea-6176-4119-b4ea-24ed76cfa8fe",
+ "metadata": {},
+ "source": [
+ "### Load example data\n",
+ "\n",
+ "Once your Druid environment is up and running, ingest the sample data for
this tutorial.\n",
+ "\n",
+ "Run the following cell to create a table called
`example-flights-quantiles`. When completed, you'll see a description of the
final table.\n",
+ "\n",
+ "Monitor the ingestion task process in the Druid console."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "926a6366-61e5-4aaf-acab-ad90fbc5a994",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql='''\n",
+ "REPLACE INTO \"example-flights-quantiles\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip\"]}',\n",
+ " '{\"type\":\"csv\",\"findColumnsFromHeader\":true}'\n",
+ " )\n",
+ ") EXTEND (\"depaturetime\" VARCHAR, \"arrivalime\" VARCHAR, \"Year\"
BIGINT, \"Quarter\" BIGINT, \"Month\" BIGINT, \"DayofMonth\" BIGINT,
\"DayOfWeek\" BIGINT, \"FlightDate\" VARCHAR, \"Reporting_Airline\" VARCHAR,
\"DOT_ID_Reporting_Airline\" BIGINT, \"IATA_CODE_Reporting_Airline\" VARCHAR,
\"Tail_Number\" VARCHAR, \"Flight_Number_Reporting_Airline\" BIGINT,
\"OriginAirportID\" BIGINT, \"OriginAirportSeqID\" BIGINT,
\"OriginCityMarketID\" BIGINT, \"Origin\" VARCHAR, \"OriginCityName\" VARCHAR,
\"OriginState\" VARCHAR, \"OriginStateFips\" BIGINT, \"OriginStateName\"
VARCHAR, \"OriginWac\" BIGINT, \"DestAirportID\" BIGINT, \"DestAirportSeqID\"
BIGINT, \"DestCityMarketID\" BIGINT, \"Dest\" VARCHAR, \"DestCityName\"
VARCHAR, \"DestState\" VARCHAR, \"DestStateFips\" BIGINT, \"DestStateName\"
VARCHAR, \"DestWac\" BIGINT, \"CRSDepTime\" BIGINT, \"DepTime\" BIGINT,
\"DepDelay\" BIGINT, \"DepDelayMinutes\" BIGINT, \"DepDel15\" BIGINT,
\"DepartureDelayGroups\" BIGINT, \"DepTimeBlk\" VAR
CHAR, \"TaxiOut\" BIGINT, \"WheelsOff\" BIGINT, \"WheelsOn\" BIGINT,
\"TaxiIn\" BIGINT, \"CRSArrTime\" BIGINT, \"ArrTime\" BIGINT, \"ArrDelay\"
BIGINT, \"ArrDelayMinutes\" BIGINT, \"ArrDel15\" BIGINT, \"ArrivalDelayGroups\"
BIGINT, \"ArrTimeBlk\" VARCHAR, \"Cancelled\" BIGINT, \"CancellationCode\"
VARCHAR, \"Diverted\" BIGINT, \"CRSElapsedTime\" BIGINT, \"ActualElapsedTime\"
BIGINT, \"AirTime\" BIGINT, \"Flights\" BIGINT, \"Distance\" BIGINT,
\"DistanceGroup\" BIGINT, \"CarrierDelay\" BIGINT, \"WeatherDelay\" BIGINT,
\"NASDelay\" BIGINT, \"SecurityDelay\" BIGINT, \"LateAircraftDelay\" BIGINT,
\"FirstDepTime\" VARCHAR, \"TotalAddGTime\" VARCHAR, \"LongestAddGTime\"
VARCHAR, \"DivAirportLandings\" VARCHAR, \"DivReachedDest\" VARCHAR,
\"DivActualElapsedTime\" VARCHAR, \"DivArrDelay\" VARCHAR, \"DivDistance\"
VARCHAR, \"Div1Airport\" VARCHAR, \"Div1AirportID\" VARCHAR,
\"Div1AirportSeqID\" VARCHAR, \"Div1WheelsOn\" VARCHAR, \"Div1TotalGTime\"
VARCHAR, \"Div1LongestGTime\" VARCHAR, \"Div
1WheelsOff\" VARCHAR, \"Div1TailNum\" VARCHAR, \"Div2Airport\" VARCHAR,
\"Div2AirportID\" VARCHAR, \"Div2AirportSeqID\" VARCHAR, \"Div2WheelsOn\"
VARCHAR, \"Div2TotalGTime\" VARCHAR, \"Div2LongestGTime\" VARCHAR,
\"Div2WheelsOff\" VARCHAR, \"Div2TailNum\" VARCHAR, \"Div3Airport\" VARCHAR,
\"Div3AirportID\" VARCHAR, \"Div3AirportSeqID\" VARCHAR, \"Div3WheelsOn\"
VARCHAR, \"Div3TotalGTime\" VARCHAR, \"Div3LongestGTime\" VARCHAR,
\"Div3WheelsOff\" VARCHAR, \"Div3TailNum\" VARCHAR, \"Div4Airport\" VARCHAR,
\"Div4AirportID\" VARCHAR, \"Div4AirportSeqID\" VARCHAR, \"Div4WheelsOn\"
VARCHAR, \"Div4TotalGTime\" VARCHAR, \"Div4LongestGTime\" VARCHAR,
\"Div4WheelsOff\" VARCHAR, \"Div4TailNum\" VARCHAR, \"Div5Airport\" VARCHAR,
\"Div5AirportID\" VARCHAR, \"Div5AirportSeqID\" VARCHAR, \"Div5WheelsOn\"
VARCHAR, \"Div5TotalGTime\" VARCHAR, \"Div5LongestGTime\" VARCHAR,
\"Div5WheelsOff\" VARCHAR, \"Div5TailNum\" VARCHAR, \"Unnamed: 109\"
VARCHAR))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"depaturetime\") AS \"__time\",\n",
+ " \"Reporting_Airline\",\n",
+ " \"Tail_Number\",\n",
+ " \"Distance\",\n",
+ " \"Origin\",\n",
+ " DepDelayMinutes + ArrDelayMinutes AS \"Delay\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready('example-flights-quantiles')\n",
+ "display.table('example-flights-quantiles')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6f0d1f7-2f34-44fe-9912-4017ed10893c",
+ "metadata": {},
+ "source": [
+ "Finally, run the following cell to import additional Python modules that
you will use."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e4cfd50d-5ef9-4fd1-8077-bf03758c35df",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "import matplotlib\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f687b0a-517a-4df3-a7cb-b7095a11bb4b",
+ "metadata": {},
+ "source": [
+ "## Understanding the Quantiles sketch\n",
+ "\n",
+ "A structure known as a \"sketch\" holds a representation of the
distribution of values that we will use throughout this notebook. These are
created by using the
[`DS_QUANTILES_SKETCH`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_quantiles_sketch)
function, providing it with the data set that we want to create the
representation of.\n",
+ "\n",
+ "To see how quantiles sketches look, run the following cell. It uses
`DS_QUANTILES_SKETCH` to create a Quantiles sketch for both `Distance` and
`Delay`."
Review Comment:
Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]