This is an automated email from the ASF dual-hosted git repository.
kfaraz pushed a commit to branch 25.0.0
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/25.0.0 by this push:
new 7866edc404 [Backport] docs: Jupyter tutorial (#13589)
7866edc404 is described below
commit 7866edc404b10f424763cf9749ead2a9fcba0acc
Author: Victoria Lim <[email protected]>
AuthorDate: Fri Dec 16 18:56:05 2022 -0800
[Backport] docs: Jupyter tutorial (#13589)
* docs: add index page and related stuff for jupyter tutorials (#13342)
* docs: notebook only for API tutorial (#13345)
* docs: notebook for API tutorial
* Apply suggestions from code review
Co-authored-by: Charles Smith <[email protected]>
* address the other comments
* typo
* add commentary to outputs
* address feedback from will
* delete unnecessary comment
Co-authored-by: Charles Smith <[email protected]>
Co-authored-by: 317brian <[email protected]>
Co-authored-by: Charles Smith <[email protected]>
---
.gitignore | 1 +
docs/tutorials/tutorial-jupyter-index.md | 71 +++
examples/quickstart/jupyter-notebooks/README.md | 89 ++++
.../jupyter-notebooks/api-tutorial.ipynb | 494 +++++++++++++++++++++
website/.spelling | 1 +
website/sidebars.json | 3 +-
6 files changed, 658 insertions(+), 1 deletion(-)
diff --git a/.gitignore b/.gitignore
index f906e24267..d6ecf2b795 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,3 +27,4 @@ README
integration-tests/gen-scripts/
/bin/
*.hprof
+**/.ipynb_checkpoints/
diff --git a/docs/tutorials/tutorial-jupyter-index.md
b/docs/tutorials/tutorial-jupyter-index.md
new file mode 100644
index 0000000000..233b9fda50
--- /dev/null
+++ b/docs/tutorials/tutorial-jupyter-index.md
@@ -0,0 +1,71 @@
+---
+id: tutorial-jupyter-index
+title: "Jupyter Notebook tutorials"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+<!-- tutorial-jupyter-index.md and
examples/quickstart/juptyer-notebooks/README.md share a lot of the same
content. If you make a change in one place, update the other too. -->
+
+You can try out the Druid APIs using the Jupyter Notebook-based tutorials.
These tutorials provide snippets of Python code that you can use to run calls
against the Druid API to complete the tutorial.
+
+## Prerequisites
+
+Make sure you meet the following requirements before starting the
Jupyter-based tutorials:
+
+- Python 3
+
+- The `requests` package for Python. For example, you can install it with the
following command:
+
+ ```bash
+ pip3 install requests
+ ```
+
+- JupyterLab (recommended) or Jupyter Notebook running on a non-default port.
By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on
a different port.
+
+ - Install JupyterLab or Notebook:
+
+ ```bash
+ # Install JupyterLab
+ pip3 install jupyterlab
+ # Install Jupyter Notebook
+ pip3 install notebook
+ ```
+ - Start Jupyter
+ - JupyterLab
+ ```bash
+ # Start JupyterLab on port 3001
+ jupyter lab --port 3001
+ ```
+ - Jupyter Notebook
+ ```bash
+ # Start Jupyter Notebook on port 3001
+ jupyter notebook --port 3001
+ ```
+
+- An available Druid instance. You can use the `micro-quickstart`
configuration described in [Quickstart (local)](./index.md). The tutorials
assume that you are using the quickstart, so no authentication or authorization
is expected unless explicitly mentioned.
+
+## Tutorials
+
+The notebooks are located in the [apache/druid
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).
You can either clone the repo or download the notebooks you want individually.
+
+The links that follow are the raw GitHub URLs, so you can use them to download
the notebook directly, such as with `wget`, or manually through your web
browser. Note that if you save the file from your web browser, make sure to
remove the `.txt` extension.
+
+- [Introduction to the Druid
API](https://raw.githubusercontent.com/apache/druid/master/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb)
walks you through some of the basics related to the Druid API and several
endpoints.
\ No newline at end of file
diff --git a/examples/quickstart/jupyter-notebooks/README.md
b/examples/quickstart/jupyter-notebooks/README.md
new file mode 100644
index 0000000000..7e5fa2beca
--- /dev/null
+++ b/examples/quickstart/jupyter-notebooks/README.md
@@ -0,0 +1,89 @@
+# Jupyter Notebook tutorials for Druid
+
+<!-- This README and the tutorial-jupyter-index.md file in docs/tutorials
share a lot of the same content. If you make a change in one place, update the
other too. -->
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+You can try out the Druid APIs using the Jupyter Notebook-based tutorials.
These tutorials provide snippets of Python code that you can use to run calls
against the Druid API to complete the tutorial.
+
+## Prerequisites
+
+Make sure you meet the following requirements before starting the
Jupyter-based tutorials:
+
+- Python 3
+
+- The `requests` package for Python. For example, you can install it with the
following command:
+
+ ```bash
+ pip3 install requests
+ ````
+
+- JupyterLab (recommended) or Jupyter Notebook running on a non-default port.
By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on
a different port.
+
+ - Install JupyterLab or Notebook:
+
+ ```bash
+ # Install JupyterLab
+ pip3 install jupyterlab
+ # Install Jupyter Notebook
+ pip3 install notebook
+ ```
+ - Start Jupyter:
+ - JupyterLab
+ ```bash
+ # Start JupyterLab on port 3001
+ jupyter lab --port 3001
+ ```
+ - Jupyter Notebook
+ ```bash
+ # Start Jupyter Notebook on port 3001
+ jupyter notebook --port 3001
+ ```
+
+- An available Druid instance. You can use the `micro-quickstart`
configuration described in [Quickstart
(local)](../../../docs/tutorials/index.md). The tutorials assume that you are
using the quickstart, so no authentication or authorization is expected unless
explicitly mentioned.
+
+## Tutorials
+
+The notebooks are located in the [apache/druid
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).
You can either clone the repo or download the notebooks you want individually.
+
+The links that follow are the raw GitHub URLs, so you can use them to download
the notebook directly, such as with `wget`, or manually through your web
browser. Note that if you save the file from your web browser, make sure to
remove the `.txt` extension.
+
+- [Introduction to the Druid API](api-tutorial.ipynb) walks you through some
of the basics related to the Druid API and several endpoints.
+
+## Contributing
+
+If you build a Jupyter tutorial, you need to do a few things to add it to the
docs in addition to saving the notebook in this directory. The process requires
two PRs to the repo.
+
+For the first PR, do the following:
+
+1. Clear the outputs from your notebook before you make the PR. You can use
the following command:
+
+ ```bash
+ jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
./path/to/notebook/notebookName.ipynb
+ ```
+
+2. Create the PR as you normally would. Make sure to note that this PR is the
one that contains only the Jupyter notebook and that there will be a subsequent
PR that updates related pages.
+
+3. After this first PR is merged, grab the "raw" URL for the file from GitHub.
For example, navigate to the file in the GitHub web UI and select **Raw**. Use
the URL for this in the second PR as the download link.
+
+For the second PR, do the following:
+
+1. Update the list of [Tutorials](#tutorials) on this page and in the [
Jupyter tutorial index
page](../../../docs/tutorials/tutorial-jupyter-index.md#tutorials) in the
`docs/tutorials` directory.
+2. Update `tutorial-jupyter-index.md` and provide the URL to the raw version
of the file that becomes available after the first PR is merged.
diff --git a/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb
b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb
new file mode 100644
index 0000000000..b795babaef
--- /dev/null
+++ b/examples/quickstart/jupyter-notebooks/api-tutorial.ipynb
@@ -0,0 +1,494 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Learn the basics of the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces you to the basics of the Druid API and some of
the endpoints you might use frequently to perform tasks, such as the
following:\n",
+ "\n",
+ "- Checking if your cluster is up\n",
+ "- Ingesting data\n",
+ "- Querying data\n",
+ "\n",
+ "Different [Druid server
types](https://druid.apache.org/docs/latest/design/processes.html#server-types)
are responsible for handling different APIs for the Druid services. For
example, the Overlord service on the Master server provides the status of a
task. You'll also interact the Broker service on the Query Server to see what
datasources are available. And to run queries, you'll interact with the Broker.
The Router service on the Query servers routes API calls.\n",
+ "\n",
+ "For more information, see the [API
reference](https://druid.apache.org/docs/latest/operations/api-reference.html),
which is organized by server type.\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Query your data](#Query-your-data)\n",
+ "- [Learn more](#Learn-more)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "You'll need install the Requests library for Python before you start. For
example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster. This tutorial uses the
`micro-quickstart` config described in the [Druid
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So
download that and start it if you haven't already. In the root of the Druid
folder, run the following command to start Druid:\n",
+ "\n",
+ "```bash\n",
+ "./bin/start-micro-quickstart\n",
+ "```\n",
+ "\n",
+ "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook.
Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888`
by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter Notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In a distributed environment, you can point to other Druid services. In
this tutorial, you'll use the Router service as the `druid_host`. \n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "29c24856",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Get basic cluster information\n",
+ "\n",
+ "In this cell, you'll use the `GET /status` endpoint to return basic
information about your cluster, such as the Druid version, loaded extensions,
and resource consumption.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/status` and updates the HTTP
method to `GET`. When you run the cell, you should get a response that starts
with the version number of your Druid deployment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "baa140b8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "endpoint = \"/status\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(),
indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cbeb5a63",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Get cluster health\n",
+ "\n",
+ "The `/status/health` endpoint returns `true` if your cluster is up and
running. It's useful if you want to do something like programmatically check if
your cluster is available. When you run the following cell, you should get
`true` if your Druid cluster has finished starting up and is running."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5e51170e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/status/health\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(\"\\033[1mResponse\\033[0m: \" + response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Now that you've confirmed that your cluster is up and running, you can
start ingesting data. There are different ways to ingest data based on what
your needs are. For more information, see [Ingestion
methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
+ "\n",
+ "This tutorial uses the multi-stage query (MSQ) task engine and its
`sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint
accepts [SQL requests in the JSON-over-HTTP
format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body)
using the `query`, `context`, and `parameters` fields\n",
+ "\n",
+ "To learn more about SQL-based ingestion, see [SQL-based
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html).
For information about the endpoint specifically, see [SQL-based ingestion and
multi-stage query task
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n",
+ "\n",
+ "\n",
+ "The next cell does the following:\n",
+ "\n",
+ "- Includes a payload that inserts data from an external source into a
table named wikipedia_api. The payload is in JSON format and included in the
code directly. You can also store it in a file and provide the file. \n",
+ "- Saves the response to a unique variable that you can reference later to
identify this ingestion task\n",
+ "\n",
+ "The example uses INSERT, but you could also use REPLACE INTO. In fact, if
you have an existing datasource with the name `wikipedia_api`, you need to use
REPLACE INTO instead. \n",
+ "\n",
+ "The MSQ task engine uses a task to ingest data. The response for the API
includes a `taskId` and `state` for your ingestion. You can use this `taskId`
to reference this task later on to get more information about it.\n",
+ "\n",
+ "Before you ingest the data, take a look at the query. Pay attention to
two parts of it, `__time` and `PARTITIONED BY`, which relate to how Druid
partitions data:\n",
+ "\n",
+ "- **`__time`**\n",
+ "\n",
+ " The `__time` column is a key concept for Druid. It's the default
partition for Druid and is treated as the primary timestamp. Use it to help you
write faster and more efficient queries. Big datasets, such as those for event
data, typically have a time component. This means that instead of writing a
query using only `COUNT`, you can combine that with `WHERE __time` to return
results much more quickly.\n",
+ "\n",
+ "- **`PARTITIONED BY DAY`**\n",
+ "\n",
+ " If you partition by day, Druid creates segment files within the
partition based on the day. You can only replace, delete and update data at the
partition level. So when you're deciding how to partition data, make the
partition large enough (min 500,000 rows) for good performance but not so big
that those operations become impractical to run.\n",
+ "\n",
+ "To learn more, see
[Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html).\n",
+ "\n",
+ "Now, run the next cell to start the ingestion."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "362b6a87",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql/task\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "\n",
+ "# The query uses INSERT INTO. If you have an existing datasource with the
name wikipedia_api, use REPLACE INTO instead.\n",
+ "payload = json.dumps({\n",
+ "\"query\": \"INSERT INTO wikipedia_api SELECT
TIME_PARSE(\\\"timestamp\\\") \\\n",
+ " AS __time, * FROM TABLE \\\n",
+ " (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\":
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\":
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"},
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": [...]
+ " PARTITIONED BY DAY\",\n",
+ " \"context\": {\n",
+ " \"maxNumTasks\": 3\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "ingestion_taskId_response = response\n",
+ "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+ "print(f\"\\nInserting data into the table named {dataSourceName}\")\n",
+ "print(\"\\nThe response includes the task ID and the status: \" +
response.text + \".\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "Extract the `taskId` value from the `taskId_response` variable so that
you can reference it later:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f578b9b2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ingestion_taskId =
json.loads(ingestion_taskId_response.text)['taskId']\n",
+ "print(f\"This is the task ID: {ingestion_taskId}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Get the status of your task\n",
+ "\n",
+ "The following cell shows you how to get the status of your ingestion
task. The example continues to run API calls against the endpoint to fetch the
status until the ingestion task completes. When it's done, you'll see the JSON
response.\n",
+ "\n",
+ "You can see basic information about your query, such as when it started
and whether or not it's finished.\n",
+ "\n",
+ "In addition to the status, you can retrieve a full report about it if you
want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need
that information for this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fdbab6ae",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "ingestion_status = json.loads(response.text)['status']['status']\n",
+ "# If you only want to fetch the status once and print it, \n",
+ "# uncomment the print statement and comment out the if and while loops\n",
+ "# print(json.dumps(response.json(), indent=4))\n",
+ "\n",
+ "\n",
+ "if ingestion_status == \"RUNNING\":\n",
+ " print(\"The ingestion is running...\")\n",
+ "\n",
+ "while ingestion_status != \"SUCCESS\":\n",
+ " response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ " ingestion_status = json.loads(response.text)['status']['status']\n",
+ " time.sleep(15) \n",
+ " \n",
+ "if ingestion_status == \"SUCCESS\": \n",
+ " print(\"The ingestion is complete:\")\n",
+ " print(json.dumps(response.json(), indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1336ddd5-8a42-41af-8913-533316221c52",
+ "metadata": {},
+ "source": [
+ "Wait until your ingestion completes before proceeding. Depending on what
else is happening in your Druid cluster and the resources available, ingestion
can take some time."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Query your data\n",
+ "\n",
+ "When you ingest data into Druid, Druid stores the data in a datasource,
and this datasource is what you run queries against.\n",
+ "\n",
+ "### List your datasources\n",
+ "\n",
+ "You can get a list of datasources from the
`/druid/coordinator/v1/datasources` endpoint. Since you're just getting
started, there should only be a single datasource, the `wikipedia_api` table
you created earlier when you ingested external data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "959e3c9b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/coordinator/v1/datasources\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(\"\\nThe response is the list of datasources available in your
Druid deployment: \" + response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### SELECT data\n",
+ "\n",
+ "Now, you can query the data. Because this tutorial is running in Jupyter,
make sure to limit the size of your query results using `LIMIT`. For example,
the following cell selects all columns but limits the results to 3 rows for
display purposes because each row is a JSON object. In actual use cases, you'll
want to only select the rows that you need. For more information about the
kinds of things you can do, see [Druid
SQL](https://druid.apache.org/docs/latest/querying/sql.html).\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "694900d0-891f-41bd-9b45-5ae957385244",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT * FROM wikipedia_api LIMIT 3\"\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+ "print(f\"\\nEach JSON object in the response represents a row in the
{dataSourceName} datasource.\") \n",
+ "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(),
indent=4))\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "In addition to the query, there are a few additional things you can
define within the payload. For a full list, see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
+ "\n",
+ "This tutorial uses a context parameter and a dynamic parameter.\n",
+ "\n",
+ "Context parameters can control certain characteristics related to a
query, such as configuring a custom timeout. For information, see [Context
parameters](https://druid.apache.org/docs/latest/querying/query-context.html).
In the example query that follows, the context block assigns a custom
`sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a
custom ID, you can use it to reference the query more easily. For example, if
you need to cancel a query.\n",
+ "\n",
+ "\n",
+ "Druid supports dynamic parameters, so you can either define certain
parameters within the query explicitly or insert a `?` as a placeholder and
define it in a parameters block. In the following cell, the `?` gets bound to
the timestmap value of `2016-06-27` at execution time. For more information,
see [Dynamic
parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n",
+ "\n",
+ "\n",
+ "The following cell selects rows where the `__time` column contains a
value greater than the value defined dynamically in `parameters` and sets a
custom `sqlQueryId`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT * FROM wikipedia_api WHERE __time > ? LIMIT 1\",\n",
+ " \"context\": {\n",
+ " \"sqlQueryId\" : \"important-query\" \n",
+ " },\n",
+ " \"parameters\": [\n",
+ " { \"type\": \"TIMESTAMP\", \"value\": \"2016-06-27\"}\n",
+ " ]\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+ "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(),
indent=4))\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Learn more\n",
+ "\n",
+ "This tutorial covers the some of the basics related to the Druid API. To
learn more about the kinds of things you can do, see the API documentation:\n",
+ "\n",
+ "- [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
+ "- [API
reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n",
+ "\n",
+ "You can also try out the
[druid-client](https://github.com/paul-rogers/druid-client), a Python library
for Druid created by Paul Rogers, a Druid contributor.\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.6"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/website/.spelling b/website/.spelling
index a279efe2a9..36a104a930 100644
--- a/website/.spelling
+++ b/website/.spelling
@@ -142,6 +142,7 @@ JVMs
Joda
JsonProperty
Jupyter
+JupyterLab
KMS
Kerberized
Kerberos
diff --git a/website/sidebars.json b/website/sidebars.json
index 1ebc214027..a442cfaf34 100644
--- a/website/sidebars.json
+++ b/website/sidebars.json
@@ -22,7 +22,8 @@
"tutorials/tutorial-transform-spec",
"tutorials/docker",
"tutorials/tutorial-kerberos-hadoop",
- "tutorials/tutorial-msq-convert-spec"
+ "tutorials/tutorial-msq-convert-spec",
+ "tutorials/tutorial-jupyter-index"
],
"Design": [
"design/architecture",
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]