techdocsmith commented on code in PR #13345:
URL: https://github.com/apache/druid/pull/13345#discussion_r1029866141
##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -0,0 +1,442 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Tutorial: Learn the basics of the Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ " \n",
+ "This tutorial introduces you to the basics of the Druid API and some of
the endpoints you might use frequently, including the following tasks:\n",
+ "\n",
+ "- Checking if your cluster is up\n",
+ "- Ingesting data\n",
+ "- Querying data\n",
+ "- Deleting data\n",
+ "\n",
+ "In a Druid deployment, you have [Mastery, Query, and Data
servers](https://druid.apache.org/docs/latest/design/processes.html#server-types)
that all fulfill different purposes. The endpoint you use for a certain action
is determined, partially, by which server governs that part of Druid and the
processes that run on that server type. That's why the [API
reference](https://druid.apache.org/docs/latest/operations/api-reference.html#historical)
is organized by server type and process.\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Before you start](#Before-you-start)\n",
+ "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Query your data](#Query-your-data)\n",
+ "- [Manage your data](#Manage-your-data)\n",
+ "- [Next steps](#Next-steps)\n",
+ "\n",
+ "For the best experience, use Jupyter Lab so that you can always access
the table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8d6bbbcb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Requirements\n",
+ "\n",
+ "You'll need install the Requests library for Python before you start. For
example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Next, you'll need a Druid cluster. This tutorial uses the
`micro-quickstart` config described in the [Druid
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So
download that and start it if you haven't already. In the root of the Druid
folder, run the following command to start Druid:\n",
+ "\n",
+ "```bash\n",
+ "./bin/start-micro-quickstart\n",
+ "```\n",
+ "\n",
+ "Finally, you'll need either Jupyter lab (recommended) or Jupyter
notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed
at `localhost:8888` by default, so you'll \n",
+ "need to change the port for Jupyter. To do so, stop Jupyter and start it
again with the `port` parameter included. For example, you can use the
following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using Jupyter lab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host the tutorial uses.
The quickstart deployment configures Druid to listen on port `8888` by default,
so you'll be making API calls against `http://localhost:8888`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSourceName = \"wikipedia_api\"\n",
+ "print(druid_host)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint`, `http_method`, and
`payload` variables are updated in code cells to call a different Druid
endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "29c24856",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Get basic cluster information\n",
+ "\n",
+ "In this cell, you'll use the `GET /status` endpoint to return basic
information about your cluster, such as the Druid version, loaded extensions,
and resource consumption.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/status` and updates the HTTP
method to `GET`. When you run the cell, you should get a response that starts
with the version number of your Druid deployment."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "baa140b8",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "endpoint = \"/status\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(json.dumps(json.loads(response.text), indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cbeb5a63",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Get cluster health\n",
+ "\n",
+ "The `/status/health` endpoint returns `true` if your cluster is up and
running. It's useful if you want to do something like programmatically check if
your cluster is available. When you run the following cell, you should get
`true` if your Druid cluster has finished starting up and is running."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5e51170e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# GET \n",
+ "endpoint = \"/status/health\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Now that you've confirmed that your cluster is up and running, you can
start ingesting data. There are different ways to ingest data based on what
your needs are. For more information, see [Ingestion
methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
+ "\n",
+ "This tutorial uses the multi-stage query (MSQ) task engine and its
`sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint
accepts [SQL requests in the JSON-over-HTTP
format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body)
using the query, context, and parameters fields\n",
+ "\n",
+ "To learn more about SQL-based ingestion, see [SQL-based
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html).
For information about the endpoint specifically, see [SQL-based ingestion and
multi-stage query task
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n",
+ "\n",
+ "\n",
+ "The next cell does the following:\n",
+ "\n",
+ "- Includes a payload that inserts data from an external source into a
table named wikipedia_api. The payload is in JSON format and included in the
code directly. You can also store it in a file and provide the file. \n",
+ "- Saves the response to a unique variable that you can reference later to
identify this ingestion task\n",
+ "\n",
+ "The example uses INSERT, but you could also use REPLACE. \n",
+ "\n",
+ "For the MSQ task engine, ingesting data is done through a task, so the
response includes a `taskId` and `state` for your ingestion. You can use this
`taskId` to reference this task later on to get more information about it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "362b6a87",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql/task\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "\n",
+ "payload = json.dumps({\n",
+ "\"query\": \"INSERT INTO wikipedia_api SELECT
TIME_PARSE(\\\"timestamp\\\") AS __time, * FROM TABLE( EXTERN(
'{\\\"type\\\": \\\"http\\\", \\\"uris\\\":
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}',
'{\\\"type\\\": \\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\",
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"channel\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"cityName\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"comment\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"commentLength\\\", \\\"type\\\": \\\"long\\\"},
{\\\"name\\\": \\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"countryName\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\":
\\\"delta\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\",
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"flag
s\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isAnonymous\\\",
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isMinor\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"isNew\\\", \\\"type\\\": \\\"string\\\"},
{\\\"name\\\": \\\"isRobot\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"isUnpatrolled\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"metroCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\":
\\\"namespace\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"page\\\",
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionIsoCode\\\",
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"regionName\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"timestamp\\\", \\\"type\\\":
\\\"string\\\"}, {\\\"name\\\": \\\"user\\\", \\\"type\\\": \\\"string\\\"}]'
) ) PARTITIONED BY DAY\",\n",
+ " \"context\": {\n",
+ " \"maxNumTasks\": 3\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "ingestiion_taskId_response = response\n",
+ "print(response.text + f\"\\nInserting data into the table named
{dataSourceName}.\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "Extract the `taskId` value from the `taskId_response` variable so that
you can reference it later:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f578b9b2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ingestion_taskId =
json.loads(ingestiion_taskId_response.text)['taskId']\n",
+ "print(ingestion_taskId)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Get the status of your task\n",
+ "\n",
+ "The following cell shows you how to get the status of your ingestion.
You can see basic information about your query, such as when it started and
whether or not it's finished.\n",
+ "\n",
+ "In addition to the status, you can retrieve a full report about it if you
want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need
that information for this tutorial."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fdbab6ae",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(json.dumps(json.loads(response.text), indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Query your data\n",
+ "\n",
+ "When you ingest data into Druid, Druid stores the data in a datasource,
and this datasource is what you run queries against.\n",
+ "\n",
+ "### List your datasources\n",
+ "\n",
+ "You can get a list of datasources from the
`/druid/coordinator/v1/datasources` endpoint. Since you're just getting
started, there should only be a single datasource, the `wikipedia_api` table
you created earlier when you ingested external data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "959e3c9b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/coordinator/v1/datasources\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "### Query your data\n",
+ "\n",
+ "Now, you can query the data. Because this tutorial is running in Jupyter,
make sure to limit the size of your query results using `LIMIT`. For example,
the following cell selects all columns but limits the results to 3 rows for
display purposes.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "694900d0-891f-41bd-9b45-5ae957385244",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT * FROM wikipedia_api LIMIT 3\"\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(json.dumps(json.loads(response.text), indent=4))\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "In addition to the query, there are a few additional things you can
define within the payload. For a full list, see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
+ "\n",
+ "This tutorial uses a context parameter and a dynamic parameter.\n",
+ "\n",
+ "Context parameters can control certain characteristics related to a
query, such as configuring a custom timeout. For information, see [Context
parameters](https://druid.apache.org/docs/latest/querying/query-context.html).
In the example query that follows, the context block assigns a custom
`sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a
custom ID, you can use it to reference the query more easily like when you need
to cancel a query.\n",
+ "\n",
+ "\n",
+ "Druid supports dynamic parameters, so you can either define certain
parameters within the query explicitly or insert a `?` as a placeholder and
define it in a parameters block. In the following cell, the `?` gets bound to
the timestmap value of `2016-06-27` at execution time. For more information,
see [Dynamic
parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n",
+ "\n",
+ "\n",
+ "The following cell selects rows where the `__time` column contains a
value greater than the value defined dynamically in `parameters` and sets a
custom `sqlQueryId`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c3860d64-fba6-43bc-80e2-404f5b3b9baa",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT * FROM wikipedia_api WHERE __time > ? LIMIT 1\",\n",
+ " \"context\": {\n",
+ " \"sqlQueryId\" : \"important-query\" \n",
+ " },\n",
+ " \"parameters\": [\n",
+ " { \"type\": \"TIMESTAMP\", \"value\": \"2016-06-27\"}\n",
+ " ]\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "print(json.dumps(json.loads(response.text), indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8fbfa1fa-2cde-46d5-8107-60bd436fb64e",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Next steps\n",
+ "\n",
+ "This tutorial covers the some of the basics related to the Druid API. To
learn more about the kinds of things you can do, see the API documentation:\n",
+ "\n",
+ "- [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
+ "- [API
reference](https://druid.apache.org/docs/latest/operations/api-reference.html)\n",
+ "\n",
+ "You can also try out the
[druid-client](https://github.com/paul-rogers/druid-client), a Python library
for Druid created by a Druid contributor.\n",
Review Comment:
I think mention "Druid contributor" Paul Rogers explicitly here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]