317brian commented on code in PR #13485:
URL: https://github.com/apache/druid/pull/13485#discussion_r1088374102
##########
examples/quickstart/jupyter-notebooks/Untitled.ipynb:
##########
@@ -0,0 +1,159 @@
+{
Review Comment:
There's this untitled file and the checkpoints folder in this PR
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines the Druid host where the Druid Router service listens.
The quickstart deployment configures the to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "import pandas as pd\n",
+ "from bokeh.palettes import Spectral10\n",
+ "from bokeh.io import output_notebook, show\n",
+ "from bokeh.plotting import figure\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment.\n",
+ "# In a distributed environment, use the Router service as the
`druid_host`.\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f'Druid host: {druid_host}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Display data with Pandas\n",
+ "\n",
+ "By default, when you query Druid using the API, Druid returns the results
as JSON. The JSON output is great for programmatic operations, but it is not
easy to scan the data visually. This section shows you how to use the Python
pandas module to transform JSON query results to tabular format. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell selects the top 10 channels by the number of additions
and outputs the data as JSON."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+ "\n",
+ "http_method = \"POST\"\n",
+ "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(f'Query: {query}')\n",
+ "print('Query results:')\n",
+ "print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell takes the JSON results from the Druid API and uses the
pandas
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
method to convert the results to a pandas object.\n",
+ "You can display the pandas ojbect in a tabular format.\n",
+ "The `orient` parameter sets pandas to accept data as a list of records in
the format: `[{column: value, ...}, {column: value, ...}]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "additions_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+ "metadata": {},
+ "source": [
+ "You can also load your results into a pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+ "\n",
+ "additions_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+ "metadata": {},
+ "source": [
+ "One benefit of using a DataFrame is that you can access all the pandas
`DataFrame` object methods. For example, `DataFrame.max()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df.max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+ "metadata": {},
+ "source": [
+ "Check out the [pandas
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
for more ideas about how to use pandas and DataFrames \n",
+ "with your Druid data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+ "metadata": {},
+ "source": [
+ "## Display data with a bar graph\n",
+ "\n",
+ "Tabular format is OK for scanning data, however you can use visualization
and plotting tools that work with Jupyter notebooks to visualize data as plots
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some
basic plots using Druid data.\n",
+ "\n",
+ "In this section, you create a simple bar chart showing the channels with
the most additions during the time range."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+ "metadata": {},
+ "source": [
+ "First, call `output_notebook()` to set Bokeh to output plots inline
within the notebook. This lets you view your plot inline. You can also use
Bokeh's `output_file` function to write your plot to an HTML file. If you want
to experiment with `output_file` add it to the list of imports."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "output_notebook()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+ "metadata": {},
+ "source": [
+ "There are several ways to use Bokeh with a DataFrame. In this case, make
a list of channels to serve as the x axis of our plot. For the y axis, divide
the total additions by 100000 for ease of display."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = additions_pd.channel.to_list()\n",
+ "total_additions = [x / 100000 for x in additions_pd.additions.to_list()]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "total_additions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+ "metadata": {},
+ "source": [
+ "Next, initialize the Bokeh plot (figure) with some basic configurations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " # Create a new plot with a title. Set the size in pixels\n",
+ "bar_plot = figure(height=500, width=750, x_range=channels,
title=\"100000 x Additions\",\n",
+ " toolbar_location=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+ "metadata": {},
+ "source": [
+ "Now, configure the rederer for the vertical bars on the plot:\n",
+ "- Set the x-axis to `channels`, the list of channels.\n",
+ "- Set the `top` coordinate that determines the bar height to `totals`.\n",
+ "- For a splash of color, set the `color` to `Spectral10`. \n",
+ "\n",
+ "Note that palettes in Bokeh are lists of colors. Bokeh expects the list
length to equal the list length of the data dimensions -- in this case 10
colors.\n",
+ "\n",
+ "See the Bokeh docs for more information on [vertical
bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar)
and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3e40279-6cf1-4e84-820a-2465e5e9d84a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.vbar(x=channels, top=total_additions, width=0.5,
color=Spectral10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04a6581b-fa11-40ec-ba39-04d08e80dd09",
+ "metadata": {},
+ "source": [
+ "Leave off the x-axis grid lines for this plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca348337-abdd-4e34-a81a-7872736f14ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.xgrid.grid_line_color = None"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5364774-b8e7-46a0-82d3-f1a9f09a6cc0",
+ "metadata": {},
+ "source": [
+ "Now, configure the y-axis:\n",
+ " - Set the minimum value to `0`.\n",
+ " - Set the visible range to `(0,40)`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5f83f5c-f546-4a71-be36-30f659a9b94b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.y_range.start = 0\n",
+ "y_range=(0, 40)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b54c25a-07af-4ade-96d5-15ca04178fe2",
+ "metadata": {},
+ "source": [
+ "Finally, display your plot with the `show()` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "44a604ec-208c-4d7a-9d25-97b308328407",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "show(bar_plot)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "772a20f2-2ea9-4e67-96fa-0a1eab32bb5a",
+ "metadata": {},
+ "source": [
+ "## Display data with a line graph"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fd43bc4-d048-4319-8dd0-2b0ca94fd33b",
+ "metadata": {},
+ "source": [
+ "In this section, you'll create a line graph that compares the following,
per channel:\n",
+ "\n",
+ "* Total additions\n",
+ "* Number of unique editors\n",
+ "* Number of bot edits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f915e03-103c-4151-93c4-ad779741f2a5",
+ "metadata": {},
+ "source": [
+ "First, change the query to include the editors and robots and load the
results into a new pandas object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "910920f5-db6a-4c3a-95fc-d25763d45de4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"\"\"\n",
+ " SELECT channel, SUM(added) AS additions,\n",
+ " COUNT (DISTINCT user) as editors,\n",
+ " SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots\n",
+ " FROM wikipedia_api\n",
+ " GROUP BY channel ORDER BY additions DESC LIMIT 10\n",
+ " \"\"\"\n",
+ "\n",
+ "# Update the payload.\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "\n",
+ "# Submit the request.\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "# Add the results to a pandas DataFrame.\n",
+ "editors_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "editors_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f761a0b-efe1-4e4a-bfa5-a80aca3e2cc8",
+ "metadata": {},
+ "source": [
+ "Set up values for your plots:\n",
+ "- Channels remains the basis for the x-axis.\n",
+ "- This time, change the resolution for additions to 1000 so it will fit
nicely on our plot with the editors and the robots values.\n",
Review Comment:
```suggestion
"- This time, change the resolution for additions to 1000, so it will
fit nicely on our plot with the editors and the robots values.\n",
```
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines the Druid host where the Druid Router service listens.
The quickstart deployment configures the to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "import pandas as pd\n",
+ "from bokeh.palettes import Spectral10\n",
+ "from bokeh.io import output_notebook, show\n",
+ "from bokeh.plotting import figure\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment.\n",
+ "# In a distributed environment, use the Router service as the
`druid_host`.\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f'Druid host: {druid_host}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Display data with Pandas\n",
+ "\n",
+ "By default, when you query Druid using the API, Druid returns the results
as JSON. The JSON output is great for programmatic operations, but it is not
easy to scan the data visually. This section shows you how to use the Python
pandas module to transform JSON query results to tabular format. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell selects the top 10 channels by the number of additions
and outputs the data as JSON."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+ "\n",
+ "http_method = \"POST\"\n",
+ "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(f'Query: {query}')\n",
+ "print('Query results:')\n",
+ "print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell takes the JSON results from the Druid API and uses the
pandas
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
method to convert the results to a pandas object.\n",
+ "You can display the pandas ojbect in a tabular format.\n",
+ "The `orient` parameter sets pandas to accept data as a list of records in
the format: `[{column: value, ...}, {column: value, ...}]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "additions_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+ "metadata": {},
+ "source": [
+ "You can also load your results into a pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+ "\n",
+ "additions_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+ "metadata": {},
+ "source": [
+ "One benefit of using a DataFrame is that you can access all the pandas
`DataFrame` object methods. For example, `DataFrame.max()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df.max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+ "metadata": {},
+ "source": [
+ "Check out the [pandas
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
for more ideas about how to use pandas and DataFrames \n",
+ "with your Druid data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+ "metadata": {},
+ "source": [
+ "## Display data with a bar graph\n",
+ "\n",
+ "Tabular format is OK for scanning data, however you can use visualization
and plotting tools that work with Jupyter notebooks to visualize data as plots
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some
basic plots using Druid data.\n",
+ "\n",
+ "In this section, you create a simple bar chart showing the channels with
the most additions during the time range."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+ "metadata": {},
+ "source": [
+ "First, call `output_notebook()` to set Bokeh to output plots inline
within the notebook. This lets you view your plot inline. You can also use
Bokeh's `output_file` function to write your plot to an HTML file. If you want
to experiment with `output_file` add it to the list of imports."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "output_notebook()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+ "metadata": {},
+ "source": [
+ "There are several ways to use Bokeh with a DataFrame. In this case, make
a list of channels to serve as the x axis of our plot. For the y axis, divide
the total additions by 100000 for ease of display."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = additions_pd.channel.to_list()\n",
+ "total_additions = [x / 100000 for x in additions_pd.additions.to_list()]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "total_additions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+ "metadata": {},
+ "source": [
+ "Next, initialize the Bokeh plot (figure) with some basic configurations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " # Create a new plot with a title. Set the size in pixels\n",
+ "bar_plot = figure(height=500, width=750, x_range=channels,
title=\"100000 x Additions\",\n",
+ " toolbar_location=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+ "metadata": {},
+ "source": [
+ "Now, configure the rederer for the vertical bars on the plot:\n",
+ "- Set the x-axis to `channels`, the list of channels.\n",
+ "- Set the `top` coordinate that determines the bar height to `totals`.\n",
+ "- For a splash of color, set the `color` to `Spectral10`. \n",
+ "\n",
+ "Note that palettes in Bokeh are lists of colors. Bokeh expects the list
length to equal the list length of the data dimensions -- in this case 10
colors.\n",
+ "\n",
+ "See the Bokeh docs for more information on [vertical
bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar)
and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3e40279-6cf1-4e84-820a-2465e5e9d84a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.vbar(x=channels, top=total_additions, width=0.5,
color=Spectral10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04a6581b-fa11-40ec-ba39-04d08e80dd09",
+ "metadata": {},
+ "source": [
+ "Leave off the x-axis grid lines for this plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca348337-abdd-4e34-a81a-7872736f14ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.xgrid.grid_line_color = None"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5364774-b8e7-46a0-82d3-f1a9f09a6cc0",
+ "metadata": {},
+ "source": [
+ "Now, configure the y-axis:\n",
+ " - Set the minimum value to `0`.\n",
+ " - Set the visible range to `(0,40)`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5f83f5c-f546-4a71-be36-30f659a9b94b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.y_range.start = 0\n",
+ "y_range=(0, 40)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b54c25a-07af-4ade-96d5-15ca04178fe2",
+ "metadata": {},
+ "source": [
+ "Finally, display your plot with the `show()` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "44a604ec-208c-4d7a-9d25-97b308328407",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "show(bar_plot)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "772a20f2-2ea9-4e67-96fa-0a1eab32bb5a",
+ "metadata": {},
+ "source": [
+ "## Display data with a line graph"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fd43bc4-d048-4319-8dd0-2b0ca94fd33b",
+ "metadata": {},
+ "source": [
+ "In this section, you'll create a line graph that compares the following,
per channel:\n",
+ "\n",
+ "* Total additions\n",
+ "* Number of unique editors\n",
+ "* Number of bot edits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f915e03-103c-4151-93c4-ad779741f2a5",
+ "metadata": {},
+ "source": [
+ "First, change the query to include the editors and robots and load the
results into a new pandas object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "910920f5-db6a-4c3a-95fc-d25763d45de4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"\"\"\n",
+ " SELECT channel, SUM(added) AS additions,\n",
+ " COUNT (DISTINCT user) as editors,\n",
+ " SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots\n",
+ " FROM wikipedia_api\n",
+ " GROUP BY channel ORDER BY additions DESC LIMIT 10\n",
+ " \"\"\"\n",
+ "\n",
+ "# Update the payload.\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "\n",
+ "# Submit the request.\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "# Add the results to a pandas DataFrame.\n",
+ "editors_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "editors_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f761a0b-efe1-4e4a-bfa5-a80aca3e2cc8",
+ "metadata": {},
+ "source": [
+ "Set up values for your plots:\n",
+ "- Channels remains the basis for the x-axis.\n",
Review Comment:
This list isn't parallel. The 3rd item starts w/ a verb
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
Review Comment:
```suggestion
"Finally, you'll need either JupyterLab (recommended) or Jupyter
Notebook. Both the quickstart Druid cluster and Jupyter are deployed at
`localhost:8888` by default, so you'll \n",
```
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines the Druid host where the Druid Router service listens.
The quickstart deployment configures the to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "import pandas as pd\n",
+ "from bokeh.palettes import Spectral10\n",
+ "from bokeh.io import output_notebook, show\n",
+ "from bokeh.plotting import figure\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment.\n",
+ "# In a distributed environment, use the Router service as the
`druid_host`.\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f'Druid host: {druid_host}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Display data with Pandas\n",
+ "\n",
+ "By default, when you query Druid using the API, Druid returns the results
as JSON. The JSON output is great for programmatic operations, but it is not
easy to scan the data visually. This section shows you how to use the Python
pandas module to transform JSON query results to tabular format. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell selects the top 10 channels by the number of additions
and outputs the data as JSON."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+ "\n",
+ "http_method = \"POST\"\n",
+ "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(f'Query: {query}')\n",
+ "print('Query results:')\n",
+ "print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell takes the JSON results from the Druid API and uses the
pandas
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
method to convert the results to a pandas object.\n",
+ "You can display the pandas ojbect in a tabular format.\n",
+ "The `orient` parameter sets pandas to accept data as a list of records in
the format: `[{column: value, ...}, {column: value, ...}]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "additions_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+ "metadata": {},
+ "source": [
+ "You can also load your results into a pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+ "\n",
+ "additions_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+ "metadata": {},
+ "source": [
+ "One benefit of using a DataFrame is that you can access all the pandas
`DataFrame` object methods. For example, `DataFrame.max()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df.max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+ "metadata": {},
+ "source": [
+ "Check out the [pandas
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
for more ideas about how to use pandas and DataFrames \n",
+ "with your Druid data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+ "metadata": {},
+ "source": [
+ "## Display data with a bar graph\n",
+ "\n",
+ "Tabular format is OK for scanning data, however you can use visualization
and plotting tools that work with Jupyter notebooks to visualize data as plots
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some
basic plots using Druid data.\n",
+ "\n",
+ "In this section, you create a simple bar chart showing the channels with
the most additions during the time range."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+ "metadata": {},
+ "source": [
+ "First, call `output_notebook()` to set Bokeh to output plots inline
within the notebook. This lets you view your plot inline. You can also use
Bokeh's `output_file` function to write your plot to an HTML file. If you want
to experiment with `output_file` add it to the list of imports."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "output_notebook()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+ "metadata": {},
+ "source": [
+ "There are several ways to use Bokeh with a DataFrame. In this case, make
a list of channels to serve as the x axis of our plot. For the y axis, divide
the total additions by 100000 for ease of display."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = additions_pd.channel.to_list()\n",
+ "total_additions = [x / 100000 for x in additions_pd.additions.to_list()]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "total_additions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+ "metadata": {},
+ "source": [
+ "Next, initialize the Bokeh plot (figure) with some basic configurations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " # Create a new plot with a title. Set the size in pixels\n",
+ "bar_plot = figure(height=500, width=750, x_range=channels,
title=\"100000 x Additions\",\n",
+ " toolbar_location=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+ "metadata": {},
+ "source": [
+ "Now, configure the rederer for the vertical bars on the plot:\n",
+ "- Set the x-axis to `channels`, the list of channels.\n",
+ "- Set the `top` coordinate that determines the bar height to `totals`.\n",
+ "- For a splash of color, set the `color` to `Spectral10`. \n",
+ "\n",
+ "Note that palettes in Bokeh are lists of colors. Bokeh expects the list
length to equal the list length of the data dimensions -- in this case 10
colors.\n",
+ "\n",
+ "See the Bokeh docs for more information on [vertical
bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar)
and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3e40279-6cf1-4e84-820a-2465e5e9d84a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.vbar(x=channels, top=total_additions, width=0.5,
color=Spectral10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04a6581b-fa11-40ec-ba39-04d08e80dd09",
+ "metadata": {},
+ "source": [
+ "Leave off the x-axis grid lines for this plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca348337-abdd-4e34-a81a-7872736f14ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.xgrid.grid_line_color = None"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5364774-b8e7-46a0-82d3-f1a9f09a6cc0",
+ "metadata": {},
+ "source": [
+ "Now, configure the y-axis:\n",
+ " - Set the minimum value to `0`.\n",
+ " - Set the visible range to `(0,40)`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5f83f5c-f546-4a71-be36-30f659a9b94b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.y_range.start = 0\n",
+ "y_range=(0, 40)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b54c25a-07af-4ade-96d5-15ca04178fe2",
+ "metadata": {},
+ "source": [
+ "Finally, display your plot with the `show()` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "44a604ec-208c-4d7a-9d25-97b308328407",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "show(bar_plot)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "772a20f2-2ea9-4e67-96fa-0a1eab32bb5a",
+ "metadata": {},
+ "source": [
+ "## Display data with a line graph"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fd43bc4-d048-4319-8dd0-2b0ca94fd33b",
+ "metadata": {},
+ "source": [
+ "In this section, you'll create a line graph that compares the following,
per channel:\n",
+ "\n",
+ "* Total additions\n",
+ "* Number of unique editors\n",
+ "* Number of bot edits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f915e03-103c-4151-93c4-ad779741f2a5",
+ "metadata": {},
+ "source": [
+ "First, change the query to include the editors and robots and load the
results into a new pandas object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "910920f5-db6a-4c3a-95fc-d25763d45de4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"\"\"\n",
+ " SELECT channel, SUM(added) AS additions,\n",
+ " COUNT (DISTINCT user) as editors,\n",
+ " SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots\n",
+ " FROM wikipedia_api\n",
+ " GROUP BY channel ORDER BY additions DESC LIMIT 10\n",
+ " \"\"\"\n",
+ "\n",
+ "# Update the payload.\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "\n",
+ "# Submit the request.\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "# Add the results to a pandas DataFrame.\n",
+ "editors_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "editors_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f761a0b-efe1-4e4a-bfa5-a80aca3e2cc8",
+ "metadata": {},
+ "source": [
+ "Set up values for your plots:\n",
+ "- Channels remains the basis for the x-axis.\n",
+ "- This time, change the resolution for additions to 1000 so it will fit
nicely on our plot with the editors and the robots values.\n",
+ "- Create lists of values for editors and robots."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "842b96f9-b22c-47ef-b050-fbc11b234296",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = editors_pd.channel.to_list()\n",
+ "total_additions = [x / 1000 for x in editors_pd.additions.to_list()]\n",
+ "editors = editors_pd.editors.to_list()\n",
+ "robots = editors_pd.robots.to_list()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ee812f8-fc6e-48b6-abf3-1f8cf87f07eb",
+ "metadata": {},
+ "source": [
+ "Next, create a plot the same as before, but this time leave in the Bokeh
tools so you can try those out at the end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f59bb2ee-d513-492c-b021-58ce9e9e922b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create a plot.\n",
+ "line_plot = figure(x_range=channels, height=500, width=750,
title=\"Editors vs robots\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50cfed5d-efcc-4429-9929-3235682cd869",
+ "metadata": {},
+ "source": [
+ "Change the scale to accommodate the current data set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "20a4f394-898f-467c-b77e-38e95c45c380",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "line_plot.y_range.start = 0\n",
+ "y_range=(0, 4000)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6e4224f5-41a8-42b1-a3c9-cfc0bd96d3a5",
+ "metadata": {},
+ "source": [
+ "Next, add data lines onto the plot:\n",
+ "- For the line color, we only need one color per line. To keep with the
Spectral palette, set the index (0-9) for the color you want.\n",
Review Comment:
```suggestion
"- For the line color, you only need one color per line. To keep with
the Spectral palette, set the index (0-9) for the color you want.\n",
```
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines the Druid host where the Druid Router service listens.
The quickstart deployment configures the to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "import pandas as pd\n",
+ "from bokeh.palettes import Spectral10\n",
+ "from bokeh.io import output_notebook, show\n",
+ "from bokeh.plotting import figure\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment.\n",
+ "# In a distributed environment, use the Router service as the
`druid_host`.\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f'Druid host: {druid_host}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Display data with Pandas\n",
+ "\n",
+ "By default, when you query Druid using the API, Druid returns the results
as JSON. The JSON output is great for programmatic operations, but it is not
easy to scan the data visually. This section shows you how to use the Python
pandas module to transform JSON query results to tabular format. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell selects the top 10 channels by the number of additions
and outputs the data as JSON."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+ "\n",
+ "http_method = \"POST\"\n",
+ "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(f'Query: {query}')\n",
+ "print('Query results:')\n",
+ "print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell takes the JSON results from the Druid API and uses the
pandas
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
method to convert the results to a pandas object.\n",
+ "You can display the pandas ojbect in a tabular format.\n",
+ "The `orient` parameter sets pandas to accept data as a list of records in
the format: `[{column: value, ...}, {column: value, ...}]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "additions_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+ "metadata": {},
+ "source": [
+ "You can also load your results into a pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+ "\n",
+ "additions_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+ "metadata": {},
+ "source": [
+ "One benefit of using a DataFrame is that you can access all the pandas
`DataFrame` object methods. For example, `DataFrame.max()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df.max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+ "metadata": {},
+ "source": [
+ "Check out the [pandas
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
for more ideas about how to use pandas and DataFrames \n",
+ "with your Druid data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+ "metadata": {},
+ "source": [
+ "## Display data with a bar graph\n",
+ "\n",
+ "Tabular format is OK for scanning data, however you can use visualization
and plotting tools that work with Jupyter notebooks to visualize data as plots
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some
basic plots using Druid data.\n",
Review Comment:
```suggestion
"Tabular format is OK for scanning data; however, you can use
visualization and plotting tools that work with Jupyter Notebooks to visualize
data as plots or graphs. This section uses [Bokeh](https://bokeh.org/) to
illustrate some basic plots using Druid data.\n",
```
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines the Druid host where the Druid Router service listens.
The quickstart deployment configures the to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "import pandas as pd\n",
+ "from bokeh.palettes import Spectral10\n",
+ "from bokeh.io import output_notebook, show\n",
+ "from bokeh.plotting import figure\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment.\n",
+ "# In a distributed environment, use the Router service as the
`druid_host`.\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f'Druid host: {druid_host}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Display data with Pandas\n",
+ "\n",
+ "By default, when you query Druid using the API, Druid returns the results
as JSON. The JSON output is great for programmatic operations, but it is not
easy to scan the data visually. This section shows you how to use the Python
pandas module to transform JSON query results to tabular format. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell selects the top 10 channels by the number of additions
and outputs the data as JSON."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+ "\n",
+ "http_method = \"POST\"\n",
+ "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(f'Query: {query}')\n",
+ "print('Query results:')\n",
+ "print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell takes the JSON results from the Druid API and uses the
pandas
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
method to convert the results to a pandas object.\n",
+ "You can display the pandas ojbect in a tabular format.\n",
+ "The `orient` parameter sets pandas to accept data as a list of records in
the format: `[{column: value, ...}, {column: value, ...}]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "additions_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+ "metadata": {},
+ "source": [
+ "You can also load your results into a pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+ "\n",
+ "additions_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+ "metadata": {},
+ "source": [
+ "One benefit of using a DataFrame is that you can access all the pandas
`DataFrame` object methods. For example, `DataFrame.max()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df.max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+ "metadata": {},
+ "source": [
+ "Check out the [pandas
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
for more ideas about how to use pandas and DataFrames \n",
+ "with your Druid data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+ "metadata": {},
+ "source": [
+ "## Display data with a bar graph\n",
+ "\n",
+ "Tabular format is OK for scanning data, however you can use visualization
and plotting tools that work with Jupyter notebooks to visualize data as plots
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some
basic plots using Druid data.\n",
+ "\n",
+ "In this section, you create a simple bar chart showing the channels with
the most additions during the time range."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+ "metadata": {},
+ "source": [
+ "First, call `output_notebook()` to set Bokeh to output plots inline
within the notebook. This lets you view your plot inline. You can also use
Bokeh's `output_file` function to write your plot to an HTML file. If you want
to experiment with `output_file` add it to the list of imports."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "output_notebook()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+ "metadata": {},
+ "source": [
+ "There are several ways to use Bokeh with a DataFrame. In this case, make
a list of channels to serve as the x axis of our plot. For the y axis, divide
the total additions by 100000 for ease of display."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = additions_pd.channel.to_list()\n",
+ "total_additions = [x / 100000 for x in additions_pd.additions.to_list()]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "total_additions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+ "metadata": {},
+ "source": [
+ "Next, initialize the Bokeh plot (figure) with some basic configurations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " # Create a new plot with a title. Set the size in pixels\n",
+ "bar_plot = figure(height=500, width=750, x_range=channels,
title=\"100000 x Additions\",\n",
+ " toolbar_location=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+ "metadata": {},
+ "source": [
+ "Now, configure the rederer for the vertical bars on the plot:\n",
+ "- Set the x-axis to `channels`, the list of channels.\n",
+ "- Set the `top` coordinate that determines the bar height to `totals`.\n",
+ "- For a splash of color, set the `color` to `Spectral10`. \n",
+ "\n",
+ "Note that palettes in Bokeh are lists of colors. Bokeh expects the list
length to equal the list length of the data dimensions -- in this case 10
colors.\n",
+ "\n",
+ "See the Bokeh docs for more information on [vertical
bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar)
and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3e40279-6cf1-4e84-820a-2465e5e9d84a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.vbar(x=channels, top=total_additions, width=0.5,
color=Spectral10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04a6581b-fa11-40ec-ba39-04d08e80dd09",
+ "metadata": {},
+ "source": [
+ "Leave off the x-axis grid lines for this plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca348337-abdd-4e34-a81a-7872736f14ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.xgrid.grid_line_color = None"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5364774-b8e7-46a0-82d3-f1a9f09a6cc0",
+ "metadata": {},
+ "source": [
+ "Now, configure the y-axis:\n",
+ " - Set the minimum value to `0`.\n",
+ " - Set the visible range to `(0,40)`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5f83f5c-f546-4a71-be36-30f659a9b94b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.y_range.start = 0\n",
+ "y_range=(0, 40)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b54c25a-07af-4ade-96d5-15ca04178fe2",
+ "metadata": {},
+ "source": [
+ "Finally, display your plot with the `show()` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "44a604ec-208c-4d7a-9d25-97b308328407",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "show(bar_plot)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "772a20f2-2ea9-4e67-96fa-0a1eab32bb5a",
+ "metadata": {},
+ "source": [
+ "## Display data with a line graph"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fd43bc4-d048-4319-8dd0-2b0ca94fd33b",
+ "metadata": {},
+ "source": [
+ "In this section, you'll create a line graph that compares the following,
per channel:\n",
+ "\n",
+ "* Total additions\n",
+ "* Number of unique editors\n",
+ "* Number of bot edits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f915e03-103c-4151-93c4-ad779741f2a5",
+ "metadata": {},
+ "source": [
+ "First, change the query to include the editors and robots and load the
results into a new pandas object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "910920f5-db6a-4c3a-95fc-d25763d45de4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"\"\"\n",
+ " SELECT channel, SUM(added) AS additions,\n",
+ " COUNT (DISTINCT user) as editors,\n",
+ " SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots\n",
+ " FROM wikipedia_api\n",
+ " GROUP BY channel ORDER BY additions DESC LIMIT 10\n",
+ " \"\"\"\n",
+ "\n",
+ "# Update the payload.\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "\n",
+ "# Submit the request.\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "# Add the results to a pandas DataFrame.\n",
+ "editors_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "editors_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f761a0b-efe1-4e4a-bfa5-a80aca3e2cc8",
+ "metadata": {},
+ "source": [
+ "Set up values for your plots:\n",
+ "- Channels remains the basis for the x-axis.\n",
+ "- This time, change the resolution for additions to 1000 so it will fit
nicely on our plot with the editors and the robots values.\n",
+ "- Create lists of values for editors and robots."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "842b96f9-b22c-47ef-b050-fbc11b234296",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = editors_pd.channel.to_list()\n",
+ "total_additions = [x / 1000 for x in editors_pd.additions.to_list()]\n",
+ "editors = editors_pd.editors.to_list()\n",
+ "robots = editors_pd.robots.to_list()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ee812f8-fc6e-48b6-abf3-1f8cf87f07eb",
+ "metadata": {},
+ "source": [
+ "Next, create a plot the same as before, but this time leave in the Bokeh
tools so you can try those out at the end."
Review Comment:
```suggestion
"Next, create a plot the same as before, but this time, leave in the
Bokeh tools so you can try those out at the end."
```
##########
examples/quickstart/jupyter-notebooks/druid-visualization.ipynb:
##########
@@ -0,0 +1,623 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Basic visualizations using the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces basic visualization options you can use with the
Druid API. It focuses on two Python modules:
[pandas](https://pandas.pydata.org/) and [Bokeh](https://bokeh.org/). This
tutorial builds on [Learn the basics of the Druid API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Display data in a DataFrame](#Display-data-in-a-DataFrame)\n",
+ "- [Display data with a bar graph](#Display-data-with-a-bar-graph)\n",
+ "- [Display data with a line graph](#Display-data-with-a-line-graph)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "If you haven't already, you'll need install the Requests library for
Python before you start. For example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Additionally, install the pandas and Bokeh libraries. For example:\n",
+ "```bash\n",
+ "pip3 install pandas\n",
+ "pip3 install bokeh\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster with data. This tutorial uses the
`wikipedia_api` datasource from [Learn the basics of the Druid
API](api-tutorial.ipynb).\n",
+ "\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter notebook.
Both the quickstart Druid cluster and Jupyter notebook are deployed at
`localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines the Druid host where the Druid Router service listens.
The quickstart deployment configures the to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests, json\n",
+ "import pandas as pd\n",
+ "from bokeh.palettes import Spectral10\n",
+ "from bokeh.io import output_notebook, show\n",
+ "from bokeh.plotting import figure\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment.\n",
+ "# In a distributed environment, use the Router service as the
`druid_host`.\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f'Druid host: {druid_host}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Display data with Pandas\n",
+ "\n",
+ "By default, when you query Druid using the API, Druid returns the results
as JSON. The JSON output is great for programmatic operations, but it is not
easy to scan the data visually. This section shows you how to use the Python
pandas module to transform JSON query results to tabular format. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell selects the top 10 channels by the number of additions
and outputs the data as JSON."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7dd08f7-cba4-4f14-acd3-b5384dae7d88",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f'Query endpoint: {druid_host}{endpoint}')\n",
+ "\n",
+ "http_method = \"POST\"\n",
+ "query = \"SELECT channel, SUM(added) AS additions FROM wikipedia_api
GROUP BY channel ORDER BY additions DESC LIMIT 10\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(f'Query: {query}')\n",
+ "print('Query results:')\n",
+ "print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "The following cell takes the JSON results from the Druid API and uses the
pandas
[`read_json`](https://pandas.pydata.org/docs/reference/api/pandas.read_json.html)
method to convert the results to a pandas object.\n",
+ "You can display the pandas ojbect in a tabular format.\n",
+ "The `orient` parameter sets pandas to accept data as a list of records in
the format: `[{column: value, ...}, {column: value, ...}]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "additions_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8c8f6e17-aa3e-4871-94d8-e4bed74ed516",
+ "metadata": {},
+ "source": [
+ "You can also load your results into a pandas DataFrame."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "67e6e09f-97ed-4b7c-b007-d239010f069c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df = pd.DataFrame.from_records(additions_pd)\n",
+ "\n",
+ "additions_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9488cbee-63bf-414a-aff3-207adf39e115",
+ "metadata": {},
+ "source": [
+ "One benefit of using a DataFrame is that you can access all the pandas
`DataFrame` object methods. For example, `DataFrame.max()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4d0a4ada-c9cf-4eee-a55f-b524d2dfad64",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "additions_df.max()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a1720a88-cf1b-4022-981a-69441fac54fd",
+ "metadata": {},
+ "source": [
+ "Check out the [pandas
docs](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
for more ideas about how to use pandas and DataFrames \n",
+ "with your Druid data."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07a16047-dba1-4f45-9fcb-6130c37eca61",
+ "metadata": {},
+ "source": [
+ "## Display data with a bar graph\n",
+ "\n",
+ "Tabular format is OK for scanning data, however you can use visualization
and plotting tools that work with Jupyter notebooks to visualize data as plots
or graphs. This section uses [Bokeh](https://bokeh.org/) to illustrate some
basic plots using Druid data.\n",
+ "\n",
+ "In this section, you create a simple bar chart showing the channels with
the most additions during the time range."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8420df27-0e80-48d1-a806-43f6bf81dd16",
+ "metadata": {},
+ "source": [
+ "First, call `output_notebook()` to set Bokeh to output plots inline
within the notebook. This lets you view your plot inline. You can also use
Bokeh's `output_file` function to write your plot to an HTML file. If you want
to experiment with `output_file` add it to the list of imports."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d824a82f-9e71-43c4-b289-78d4970c5cfc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "output_notebook()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5859d743-a5a7-445c-a06f-89b8c30856ed",
+ "metadata": {},
+ "source": [
+ "There are several ways to use Bokeh with a DataFrame. In this case, make
a list of channels to serve as the x axis of our plot. For the y axis, divide
the total additions by 100000 for ease of display."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "70b40bfc-106d-462c-a5a1-51d0da4c4498",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = additions_pd.channel.to_list()\n",
+ "total_additions = [x / 100000 for x in additions_pd.additions.to_list()]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "efc5f142-8b68-4430-a666-2ed961d732b9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9f619fd3-6a01-4dc6-b8eb-81e4b4ae1a31",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "total_additions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b430b4c-df81-419d-a81e-63bb0150031b",
+ "metadata": {},
+ "source": [
+ "Next, initialize the Bokeh plot (figure) with some basic configurations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "07644ccd-1f0c-4a50-8929-82559f9bf803",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ " # Create a new plot with a title. Set the size in pixels\n",
+ "bar_plot = figure(height=500, width=750, x_range=channels,
title=\"100000 x Additions\",\n",
+ " toolbar_location=None)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e0a01f4c-6f35-4eb7-9263-4f6ed6c68223",
+ "metadata": {},
+ "source": [
+ "Now, configure the rederer for the vertical bars on the plot:\n",
+ "- Set the x-axis to `channels`, the list of channels.\n",
+ "- Set the `top` coordinate that determines the bar height to `totals`.\n",
+ "- For a splash of color, set the `color` to `Spectral10`. \n",
+ "\n",
+ "Note that palettes in Bokeh are lists of colors. Bokeh expects the list
length to equal the list length of the data dimensions -- in this case 10
colors.\n",
+ "\n",
+ "See the Bokeh docs for more information on [vertical
bars](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.vbar)
and [palettes](https://docs.bokeh.org/en/latest/docs/reference/palettes.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3e40279-6cf1-4e84-820a-2465e5e9d84a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.vbar(x=channels, top=total_additions, width=0.5,
color=Spectral10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "04a6581b-fa11-40ec-ba39-04d08e80dd09",
+ "metadata": {},
+ "source": [
+ "Leave off the x-axis grid lines for this plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ca348337-abdd-4e34-a81a-7872736f14ca",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.xgrid.grid_line_color = None"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b5364774-b8e7-46a0-82d3-f1a9f09a6cc0",
+ "metadata": {},
+ "source": [
+ "Now, configure the y-axis:\n",
+ " - Set the minimum value to `0`.\n",
+ " - Set the visible range to `(0,40)`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5f83f5c-f546-4a71-be36-30f659a9b94b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "bar_plot.y_range.start = 0\n",
+ "y_range=(0, 40)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b54c25a-07af-4ade-96d5-15ca04178fe2",
+ "metadata": {},
+ "source": [
+ "Finally, display your plot with the `show()` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "44a604ec-208c-4d7a-9d25-97b308328407",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "show(bar_plot)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "772a20f2-2ea9-4e67-96fa-0a1eab32bb5a",
+ "metadata": {},
+ "source": [
+ "## Display data with a line graph"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fd43bc4-d048-4319-8dd0-2b0ca94fd33b",
+ "metadata": {},
+ "source": [
+ "In this section, you'll create a line graph that compares the following,
per channel:\n",
+ "\n",
+ "* Total additions\n",
+ "* Number of unique editors\n",
+ "* Number of bot edits"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1f915e03-103c-4151-93c4-ad779741f2a5",
+ "metadata": {},
+ "source": [
+ "First, change the query to include the editors and robots and load the
results into a new pandas object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "910920f5-db6a-4c3a-95fc-d25763d45de4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query = \"\"\"\n",
+ " SELECT channel, SUM(added) AS additions,\n",
+ " COUNT (DISTINCT user) as editors,\n",
+ " SUM(CASE WHEN isRobot='true' THEN 1 ELSE 0 END) AS robots\n",
+ " FROM wikipedia_api\n",
+ " GROUP BY channel ORDER BY additions DESC LIMIT 10\n",
+ " \"\"\"\n",
+ "\n",
+ "# Update the payload.\n",
+ "payload = json.dumps({\n",
+ " \"query\" : query\n",
+ "})\n",
+ "\n",
+ "# Submit the request.\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "# Add the results to a pandas DataFrame.\n",
+ "editors_pd = pd.read_json(json.dumps(response.json()),
orient='records')\n",
+ "\n",
+ "editors_pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f761a0b-efe1-4e4a-bfa5-a80aca3e2cc8",
+ "metadata": {},
+ "source": [
+ "Set up values for your plots:\n",
+ "- Channels remains the basis for the x-axis.\n",
+ "- This time, change the resolution for additions to 1000 so it will fit
nicely on our plot with the editors and the robots values.\n",
+ "- Create lists of values for editors and robots."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "842b96f9-b22c-47ef-b050-fbc11b234296",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "channels = editors_pd.channel.to_list()\n",
+ "total_additions = [x / 1000 for x in editors_pd.additions.to_list()]\n",
+ "editors = editors_pd.editors.to_list()\n",
+ "robots = editors_pd.robots.to_list()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7ee812f8-fc6e-48b6-abf3-1f8cf87f07eb",
+ "metadata": {},
+ "source": [
+ "Next, create a plot the same as before, but this time leave in the Bokeh
tools so you can try those out at the end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f59bb2ee-d513-492c-b021-58ce9e9e922b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Create a plot.\n",
+ "line_plot = figure(x_range=channels, height=500, width=750,
title=\"Editors vs robots\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50cfed5d-efcc-4429-9929-3235682cd869",
+ "metadata": {},
+ "source": [
+ "Change the scale to accommodate the current data set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "20a4f394-898f-467c-b77e-38e95c45c380",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "line_plot.y_range.start = 0\n",
+ "y_range=(0, 4000)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6e4224f5-41a8-42b1-a3c9-cfc0bd96d3a5",
+ "metadata": {},
+ "source": [
+ "Next, add data lines onto the plot:\n",
+ "- For the line color, we only need one color per line. To keep with the
Spectral palette, set the index (0-9) for the color you want.\n",
+ "- Bokeh adds a legend for the lines. The legend label identifies each
line in the legend.\n",
+ "- The `line_width` and `line_dash` properties control the line
appearance.\n",
+ "\n",
+ "See the [Bokeh
docs](https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html#bokeh.plotting.figure.line)
for more information about the line figures."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1a2f7688-a65b-4489-8f0f-13b5359d8d7c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Add renderers for the lines\n",
+ "addition_line = line_plot.line(channels, total_additions,
line_color=Spectral10[0], legend_label=\"Additions x 1000\",line_width=3,
line_dash='solid')\n",
+ "deletion_line = line_plot.line(channels, editors,
line_color=Spectral10[3], legend_label=\"Editors\", line_width=3,
line_dash='dashed')\n",
+ "robots_line = line_plot.line(channels, robots, line_color=Spectral10[7],
legend_label=\"Robots\", line_width=3, line_dash='dotted')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5c8e398e-b9ca-48e8-87ca-6191b3143245",
+ "metadata": {},
+ "source": [
+ "Finally, display the plot."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cf8fec2b-e71a-4c54-81f5-b3c874d9ede0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "show(line_plot)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "963d9bdf-a9f2-455c-8c32-997f2df06827",
+ "metadata": {},
+ "source": [
+ "The graph shows some interesting data: channels with more bots editing
topics tend to have fewer users editing the topcs. Those are some hard-working
bots!\n",
Review Comment:
```suggestion
"The graph shows some interesting data: channels with more bots editing
topics tend to have fewer users editing the topics. Those are some hard-working
bots!\n",
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]