[GitHub] [druid] techdocsmith commented on a diff in pull request #13485: adds a Jupyter visualization tutorial

via GitHub Wed, 08 Feb 2023 20:34:46 -0800


techdocsmith commented on code in PR #13485:
URL: https://github.com/apache/druid/pull/13485#discussion_r1088424991



##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,620 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Basic visualizations using the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces basic visualization options you can use with the 
Druid API. It focuses on two Python modules: 
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This 
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+    "\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+    "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+    "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "If you haven't already, you'll need install the Requests library for 
Python before you start. For example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Additionally, install the pandas and Bokeh libraries. For example:\n",
+    "```bash\n",
+    "pip3 install pandas\n",
+    "pip3 install bokeh\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster with data. This tutorial uses the 
`wikipedia_api` datasource from [Learn the basics of the Druid 
API](api-tutorial.ipynb).\n",
+    "\n",
+    "\n",
+    "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook. 
Both the quickstart Druid cluster and Jupyter notebook are deployed at 
`localhost:8888` by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it 
again with the `port` parameter included. For example, you can use the 
following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines `druid_host` for the host for the Druid Router 
service.The quickstart deployment configures the to listen on port `8888` by 
default, so you'll be making API calls against `http://localhost:8888`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests, json\n",
+    "import pandas as pd\n",
+    "from bokeh.palettes import Spectral10\n",
+    "from bokeh.io import output_notebook, show\n",
+    "from bokeh.plotting import figure\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment.\n",
+    "# In a distributed environment, use the Router service as the 
`druid_host`.\n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"wikipedia_api\"\n",
+    "print(f'Druid host: {druid_host}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint`, `http_method`, and 
`payload` variables are updated in code cells to call a different Druid 
endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Display data in a DataFrame\n",
+    "\n",
+    "By default, when you query Druid using the API, Druid returns the results 
as JSON. The JSON results set is great for programmatic operations, but it is 
not easy to scan the data visually. They Python pandas module makes it simple 
to transform JSON query results to tabular format. In this section, you'll 
learn how to use pandas to display query results in a tabular format.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "The following cell selects the top 10 channels by the number of additions 
and outputs the data as JSON."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql\"\n",
+    "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+    "\n",
+    "http_method = \"POST\"\n",
+    "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api_v2 
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "  \"query\" : query\n",
+    "})\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(f'Query: {query}')\n",
+    "print('Query results:')\n",
+    "print(json.dumps(response.json(), indent=4))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "The following cell takes the JSON results from the Druid API and uses the 
pandas 
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
 method to convert the results to a pandas object. The pandas ojbect lets us 
display the results in a tabular format. The `orient` parameter sets pandas to 
accept data as a list of records in the format: `[{column: value, ...}, 
{column: value, ...}]`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "additions_pd = pd.read_json(json.dumps(response.json()), 
orient='records')\n",
+    "\n",
+    "additions_pd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+   "metadata": {},
+   "source": [
+    "You can also load your results into a pandas DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+    "\n",
+    "additions_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+   "metadata": {},
+   "source": [
+    "One benefit of using a DataFrame is that you can access all the pandas 
`DataFrame` object methods. For example, `DataFrame.max()`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "additions_df.max()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+   "metadata": {},
+   "source": [
+    "Check out the [pandas 
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
 for more ideas about how to use pandas and DataFrames \n",
+    "with your Druid data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+   "metadata": {},
+   "source": [
+    "## Display data with a bar graph\n",
+    "\n",
+    "Tabular format is OK for scanning data, however you can use visualization 
and plotting tools that work with Jupyter notebooks to visualize data as plots 
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some 
basic plots using Druid data.\n",
+    "\n",
+    "In this section, you create a simple bar chart showing the channels with 
the most additions during the time range."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+   "metadata": {},
+   "source": [
+    "First, call `output_notebook()` to set Bokeh to output plots inline 
within the notebook. This lets you view your plot inline. You can also use 
Bokeh's `output_file` function to write your plot to an HTML file. If you want 
to experiment with `output_file` add it to the list of imports."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_notebook()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+   "metadata": {},
+   "source": [
+    "There are several ways to use Bokeh with a DataFrame. In this case, make 
a list of channels to serve as the x axis of our plot. For the y axis,  divide 
the total additions by 100000 for ease of display."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "channels = additions_df.channel.to_list()\n",
+    "total_additions = [x / 100000 for x in additions_df.additions.to_list()]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "channels"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total_additions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+   "metadata": {},
+   "source": [
+    "Next, initialize your Bokeh plot (figure) with some basic configurations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " # Create a new plot with a title. Set the size in pixels\n",
+    "bar_plot = figure(height=500, width=750, x_range=channels,  
title=\"100000 x Additions\",\n",
+    "           toolbar_location=None)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+   "metadata": {},
+   "source": [
+    "Now, configure the rederer for the vertical bars on the plot:\n",
+    "- Set the x axis to `channels`, the list of channels.\n",
+    "- Set the `top` coordinate that determines the bar height to `totals`.\n",
+    "- For a splash of color, set the `color` to `Spectral10`. \n",
+    "\n",
+    "Note that palettes in Bokeh are lists of colors. Bokeh expects the list 
length to equal the list length of the data dimensions -- in this case 10 
colors.\n",
+    "\n",
+    "See the Bokeh docs for more information on [vertical 
bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar)
 and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)."

Review Comment:
   This is normal behavior, but I noted expected output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a diff in pull request #13485: adds a Jupyter visualization tutorial

Reply via email to