petermarshallio commented on code in PR #14849:
URL: https://github.com/apache/druid/pull/14849#discussion_r1309767006


##########
examples/quickstart/jupyter-notebooks/notebooks/03-query/03-dataDistribution.ipynb:
##########
@@ -0,0 +1,606 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "4188e2c5-4ab0-45e3-9371-52d510a97413",
+   "metadata": {},
+   "source": [
+    "# Analysing data distributions\n",
+    "\n",
+    "The Druid database provides functions to analyse large distributions 
thanks to the Apache 
[Datasketch](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html)
 project's 
[Quantiles](https://datasketches.apache.org/docs/Quantiles/QuantilesOverview.html)
 sketch, as well as 
[t-digest](https://druid.apache.org/docs/latest/development/extensions-contrib/tdigestsketch-quantiles/),
 
[momentSketch](https://druid.apache.org/docs/latest/development/extensions-contrib/momentsketch-quantiles/),
 and 
[approxHistogram](https://druid.apache.org/docs/latest/development/extensions-core/approximate-histograms/).\n",
+    "\n",
+    "In this tutorial, work through some examples of the functions available 
in the Apache Datasketches ezxtension, a core extension that enables estimation 
of quantiles, ranks, and histograms."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "557e06e8-9b35-4b34-8322-8a8ede6de709",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "#### Run using Docker\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "   \n",
+    "#### Run without using Docker\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Apache Druid instance, with a `DRUID_HOST` local environment 
variable containing the server name of your Druid router.\n",
+    "* 
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
 a Python client for Apache Druid. Follow the instructions in the Install 
section of the README file.\n",
+    "* [matplotlib](https://matplotlib.org/), a library for creating 
visualizations in Python.\n",
+    "* [pandas](https://pandas.pydata.org/), a data analysis and manipulation 
tool."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3317c51-627c-4a44-ad73-0860a5f4c937",
+   "metadata": {},
+   "source": [
+    "### Initialize Python\n",
+    "\n",
+    "Run the next cell to set up the Druid Python client's connection to 
Apache Druid.\n",
+    "\n",
+    "If successful, the Druid version number will be shown in the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f2a19226-6abc-436d-ac3c-9c04d6026707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import os\n",
+    "\n",
+    "if 'DRUID_HOST' not in os.environ.keys():\n",
+    "    druid_host=f\"http://localhost:8888\"\n";,
+    "else:\n",
+    "    druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+    "    \n",
+    "print(f\"Opening a connection to {druid_host}.\")\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "status_client = druid.status\n",
+    "\n",
+    "status_client.version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fee22cea-6176-4119-b4ea-24ed76cfa8fe",
+   "metadata": {},
+   "source": [
+    "### Load example data\n",
+    "\n",
+    "Once your Druid environment is up and running, ingest the sample data for 
this tutorial.\n",
+    "\n",
+    "Run the following cell to create a table called 
`example-flights-quantiles`.  When completed, you'll see a description of the 
final table.\n",
+    "\n",
+    "Monitor the ingestion task process in the Druid console."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "926a6366-61e5-4aaf-acab-ad90fbc5a994",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql='''\n",
+    "REPLACE INTO \"example-flights-quantiles\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/flight_on_time/flights/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2005_11.csv.zip\"]}',\n",
+    "    '{\"type\":\"csv\",\"findColumnsFromHeader\":true}'\n",
+    "  )\n",
+    ") EXTEND (\"depaturetime\" VARCHAR, \"arrivalime\" VARCHAR, \"Year\" 
BIGINT, \"Quarter\" BIGINT, \"Month\" BIGINT, \"DayofMonth\" BIGINT, 
\"DayOfWeek\" BIGINT, \"FlightDate\" VARCHAR, \"Reporting_Airline\" VARCHAR, 
\"DOT_ID_Reporting_Airline\" BIGINT, \"IATA_CODE_Reporting_Airline\" VARCHAR, 
\"Tail_Number\" VARCHAR, \"Flight_Number_Reporting_Airline\" BIGINT, 
\"OriginAirportID\" BIGINT, \"OriginAirportSeqID\" BIGINT, 
\"OriginCityMarketID\" BIGINT, \"Origin\" VARCHAR, \"OriginCityName\" VARCHAR, 
\"OriginState\" VARCHAR, \"OriginStateFips\" BIGINT, \"OriginStateName\" 
VARCHAR, \"OriginWac\" BIGINT, \"DestAirportID\" BIGINT, \"DestAirportSeqID\" 
BIGINT, \"DestCityMarketID\" BIGINT, \"Dest\" VARCHAR, \"DestCityName\" 
VARCHAR, \"DestState\" VARCHAR, \"DestStateFips\" BIGINT, \"DestStateName\" 
VARCHAR, \"DestWac\" BIGINT, \"CRSDepTime\" BIGINT, \"DepTime\" BIGINT, 
\"DepDelay\" BIGINT, \"DepDelayMinutes\" BIGINT, \"DepDel15\" BIGINT, 
\"DepartureDelayGroups\" BIGINT, \"DepTimeBlk\" VAR
 CHAR, \"TaxiOut\" BIGINT, \"WheelsOff\" BIGINT, \"WheelsOn\" BIGINT, 
\"TaxiIn\" BIGINT, \"CRSArrTime\" BIGINT, \"ArrTime\" BIGINT, \"ArrDelay\" 
BIGINT, \"ArrDelayMinutes\" BIGINT, \"ArrDel15\" BIGINT, \"ArrivalDelayGroups\" 
BIGINT, \"ArrTimeBlk\" VARCHAR, \"Cancelled\" BIGINT, \"CancellationCode\" 
VARCHAR, \"Diverted\" BIGINT, \"CRSElapsedTime\" BIGINT, \"ActualElapsedTime\" 
BIGINT, \"AirTime\" BIGINT, \"Flights\" BIGINT, \"Distance\" BIGINT, 
\"DistanceGroup\" BIGINT, \"CarrierDelay\" BIGINT, \"WeatherDelay\" BIGINT, 
\"NASDelay\" BIGINT, \"SecurityDelay\" BIGINT, \"LateAircraftDelay\" BIGINT, 
\"FirstDepTime\" VARCHAR, \"TotalAddGTime\" VARCHAR, \"LongestAddGTime\" 
VARCHAR, \"DivAirportLandings\" VARCHAR, \"DivReachedDest\" VARCHAR, 
\"DivActualElapsedTime\" VARCHAR, \"DivArrDelay\" VARCHAR, \"DivDistance\" 
VARCHAR, \"Div1Airport\" VARCHAR, \"Div1AirportID\" VARCHAR, 
\"Div1AirportSeqID\" VARCHAR, \"Div1WheelsOn\" VARCHAR, \"Div1TotalGTime\" 
VARCHAR, \"Div1LongestGTime\" VARCHAR, \"Div
 1WheelsOff\" VARCHAR, \"Div1TailNum\" VARCHAR, \"Div2Airport\" VARCHAR, 
\"Div2AirportID\" VARCHAR, \"Div2AirportSeqID\" VARCHAR, \"Div2WheelsOn\" 
VARCHAR, \"Div2TotalGTime\" VARCHAR, \"Div2LongestGTime\" VARCHAR, 
\"Div2WheelsOff\" VARCHAR, \"Div2TailNum\" VARCHAR, \"Div3Airport\" VARCHAR, 
\"Div3AirportID\" VARCHAR, \"Div3AirportSeqID\" VARCHAR, \"Div3WheelsOn\" 
VARCHAR, \"Div3TotalGTime\" VARCHAR, \"Div3LongestGTime\" VARCHAR, 
\"Div3WheelsOff\" VARCHAR, \"Div3TailNum\" VARCHAR, \"Div4Airport\" VARCHAR, 
\"Div4AirportID\" VARCHAR, \"Div4AirportSeqID\" VARCHAR, \"Div4WheelsOn\" 
VARCHAR, \"Div4TotalGTime\" VARCHAR, \"Div4LongestGTime\" VARCHAR, 
\"Div4WheelsOff\" VARCHAR, \"Div4TailNum\" VARCHAR, \"Div5Airport\" VARCHAR, 
\"Div5AirportID\" VARCHAR, \"Div5AirportSeqID\" VARCHAR, \"Div5WheelsOn\" 
VARCHAR, \"Div5TotalGTime\" VARCHAR, \"Div5LongestGTime\" VARCHAR, 
\"Div5WheelsOff\" VARCHAR, \"Div5TailNum\" VARCHAR, \"Unnamed: 109\" 
VARCHAR))\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"depaturetime\") AS \"__time\",\n",
+    "  \"Reporting_Airline\",\n",
+    "  \"Tail_Number\",\n",
+    "  \"Distance\",\n",
+    "  \"Origin\",\n",
+    "  DepDelayMinutes + ArrDelayMinutes AS \"Delay\"\n",
+    "FROM \"ext\"\n",
+    "PARTITIONED BY DAY\n",
+    "'''\n",
+    "\n",
+    "sql_client.run_task(sql)\n",
+    "sql_client.wait_until_ready('example-flights-quantiles')\n",
+    "display.table('example-flights-quantiles')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6f0d1f7-2f34-44fe-9912-4017ed10893c",
+   "metadata": {},
+   "source": [
+    "Finally, run the following cell to import additional Python modules that 
you will use."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e4cfd50d-5ef9-4fd1-8077-bf03758c35df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "import matplotlib\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f687b0a-517a-4df3-a7cb-b7095a11bb4b",
+   "metadata": {},
+   "source": [
+    "## Understanding the Quantiles sketch\n",
+    "\n",
+    "A structure known as a \"sketch\" holds a representation of the 
distribution of values that we will use throughout this notebook. These are 
created by using the 
[`DS_QUANTILES_SKETCH`](https://druid.apache.org/docs/latest/querying/sql-functions#ds_quantiles_sketch)
 function, providing it with the data set that we want to create the 
representation of.\n",
+    "\n",
+    "To see how quantiles sketches look, run the following cell. It uses 
`DS_QUANTILES_SKETCH` to create a Quantiles sketch for both `Distance` and 
`Delay`."

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to