[GitHub] [druid] 2bethere commented on a diff in pull request #13345: docs: notebook only for API tutorial

GitBox Sun, 04 Dec 2022 14:17:24 -0800


2bethere commented on code in PR #13345:
URL: https://github.com/apache/druid/pull/13345#discussion_r1035418177



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -0,0 +1,736 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Learn the basics of the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces you to the basics of the Druid API and some of 
the endpoints you might use frequently to perform tasks, such as the 
following:\n",
+    "\n",
+    "- Checking if your cluster is up\n",
+    "- Ingesting data\n",
+    "- Querying data\n",
+    "\n",
+    "Different [Druid server 
types](https://druid.apache.org/docs/latest/design/processes.html#server-types) 
 are responsible for handling different APIs for the Druid services. For 
example, make API calls to the Overlord service on the Master server to get the 
status of a task. You'll also interact the Broker service on the Query Server 
to see what datasources are available. And to run queries, you'll interact with 
the Router on the Query server.\n",
+    "\n",
+    "For more information, see the [API 
reference](https://druid.apache.org/docs/latest/operations/api-reference.html), 
which is organized by server type.\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Query your data](#Query-your-data)\n",
+    "- [Manage your data](#Manage-your-data)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use Jupyter Lab so that you can always access 
the table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster. This tutorial uses the 
`micro-quickstart` config described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So 
download that and start it if you haven't already. In the root of the Druid 
folder, run the following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-micro-quickstart\n",
+    "```\n",
+    "\n",
+    "Finally, you'll need either Jupyter lab (recommended) or Jupyter 
notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed 
at `localhost:8888` by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it 
again with the `port` parameter included. For example, you can use the 
following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using Jupyter lab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host the tutorial uses. 
The quickstart deployment configures Druid's Router service to listen on port 
`8888` by default, so you'll be making API calls against 
`http://localhost:8888`. This is the port for the Router, which direct your API 
call to the appropriate service for most things."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "http://localhost:8888\n";
+     ]
+    }
+   ],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",

Review Comment:
   ```suggestion
       "# druid_host is the hostname and port for your Druid deployment. \n By 
default, Druid runs on localhost port 8888. You can modify this to point to 
other Druid service locations.",
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -0,0 +1,480 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Learn the basics of the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces you to the basics of the Druid API and some of 
the endpoints you might use frequently to perform tasks, such as the 
following:\n",
+    "\n",
+    "- Checking if your cluster is up\n",
+    "- Ingesting data\n",
+    "- Querying data\n",
+    "\n",
+    "Different [Druid server 
types](https://druid.apache.org/docs/latest/design/processes.html#server-types) 
 are responsible for handling different APIs for the Druid services. For 
example, the Overlord service on the Master server provides the status of a 
task. You'll also interact the Broker service on the Query Server to see what 
datasources are available. And to run queries, you'll interact with the Broker. 
The Router service on the Query servers routes API calls.\n",
+    "\n",
+    "For more information, see the [API 
reference](https://druid.apache.org/docs/latest/operations/api-reference.html), 
which is organized by server type.\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Query your data](#Query-your-data)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster. This tutorial uses the 
`micro-quickstart` config described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So 
download that and start it if you haven't already. In the root of the Druid 
folder, run the following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-micro-quickstart\n",
+    "```\n",
+    "\n",
+    "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. 
Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` 
by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it 
again with the `port` parameter included. For example, you can use the 
following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter Notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"wikipedia_api\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint`, `http_method`, and 
`payload` variables are updated in code cells to call a different Druid 
endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29c24856",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Get basic cluster information\n",
+    "\n",
+    "In this cell, you'll use the `GET /status` endpoint to return basic 
information about your cluster, such as the Druid version, loaded extensions, 
and resource consumption.\n",
+    "\n",
+    "The following cell sets `endpoint` to `/status` and updates the HTTP 
method to `GET`. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "baa140b8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = \"/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(), 
indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbeb5a63",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Get cluster health\n",
+    "\n",
+    "The `/status/health` endpoint returns `true` if your cluster is up and 
running. It's useful if you want to do something like programmatically check if 
your cluster is available. When you run the following cell, you should get 
`true` if your Druid cluster has finished starting up and is running."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e51170e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GET \n",
+    "endpoint = \"/status/health\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mResponse\\033[0m: \" + response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Now that you've confirmed that your cluster is up and running, you can 
start ingesting data. There are different ways to ingest data based on what 
your needs are. For more information, see [Ingestion 
methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
+    "\n",
+    "This tutorial uses the multi-stage query (MSQ) task engine and its 
`sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint 
accepts [SQL requests in the JSON-over-HTTP 
format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body)
 using the `query`, `context`, and `parameters` fields\n",
+    "\n",
+    "To learn more about SQL-based ingestion, see [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). 
For information about the endpoint specifically, see [SQL-based ingestion and 
multi-stage query task 
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n",
+    "\n",
+    "\n",
+    "The next cell does the following:\n",
+    "\n",
+    "- Includes a payload that inserts data from an external source into a 
table named wikipedia_api. The payload is in JSON format and included in the 
code directly. You can also store it in a file and provide the file. \n",
+    "- Saves the response to a unique variable that you can reference later to 
identify this ingestion task\n",
+    "\n",
+    "The example uses INSERT, but you could also use REPLACE. \n",
+    "\n",
+    "The MSQ task engine uses a task to ingest data. The response for the API 
includes a `taskId` and `state` for your ingestion. You can use this `taskId` 
to reference this task later on to get more information about it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "362b6a87",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
+    "          AS __time, * FROM TABLE \\\n",
+    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
+    "          PARTITIONED BY DAY\",\n",
+    "  \"context\": {\n",
+    "    \"maxNumTasks\": 3\n",
+    "  }\n",
+    "})\n",
+    "\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_taskId_response = response\n",
+    "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+    "print(f\"\\nInserting data into the table named {dataSourceName}\")\n",
+    "print(\"\\nThe response includes the task ID and the status: \" + 
response.text + \".\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Extract the `taskId` value from the `taskId_response` variable so that 
you can reference it later:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f578b9b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ingestion_taskId = 
json.loads(ingestion_taskId_response.text)['taskId']\n",
+    "print(f\"This is the task ID: {ingestion_taskId}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Get the status of your task\n",
+    "\n",
+    "The following cell shows you how to get the status of your ingestion 
task. The example continues to run API calls against the endpoint to fetch the 
status until the ingestion task completes. When it's done, you'll see the JSON 
response.\n",
+    "\n",
+    "You can see basic information about your query, such as when it started 
and whether or not it's finished.\n",
+    "\n",
+    "In addition to the status, you can retrieve a full report about it if you 
want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need 
that information for this tutorial."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fdbab6ae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_status = json.loads(response.text)['status']['status']\n",
+    "# If you only want to fetch the status once and print it, \n",
+    "# uncomment the print statement and comment out the if and while loops\n",
+    "# print(json.dumps(response.json(), indent=4))\n",
+    "\n",
+    "\n",
+    "if ingestion_status == \"RUNNING\":\n",
+    "  print(\"The ingestion is running...\")\n",
+    "\n",
+    "while ingestion_status != \"SUCCESS\":\n",
+    "  response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "  ingestion_status = json.loads(response.text)['status']['status']\n",
+    "  time.sleep(15)  \n",
+    "  \n",
+    "if ingestion_status == \"SUCCESS\": \n",
+    "  print(\"The ingestion is complete:\")\n",
+    "  print(json.dumps(response.json(), indent=4))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1336ddd5-8a42-41af-8913-533316221c52",
+   "metadata": {},
+   "source": [
+    "Wait until your ingestion completes before proceeding. Depending on what 
else is happening in your Druid cluster and the resources available, ingestion 
can take some time."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Query your data\n",
+    "\n",
+    "When you ingest data into Druid, Druid stores the data in a datasource, 
and this datasource is what you run queries against.\n",
+    "\n",
+    "### List your datasources\n",
+    "\n",
+    "You can get a list of datasources from the 
`/druid/coordinator/v1/datasources` endpoint. Since you're just getting 
started, there should only be a single datasource, the `wikipedia_api` table 
you created earlier when you ingested external data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "959e3c9b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/coordinator/v1/datasources\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "print(\"\\nThe response is the list of datasources available in your 
Druid deployment: \" + response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Query your data\n",
+    "\n",
+    "Now, you can query the data. Because this tutorial is running in Jupyter, 
make sure to limit the size of your query results using `LIMIT`. For example, 
the following cell selects all columns but limits the results to 3 rows for 
display purposes because each row is a JSON object. In actual use cases, you'll 
want to only select the rows that you need. For more information about the 
kinds of things you can do, see [Druid 
SQL](https://druid.apache.org/docs/latest/querying/sql.html).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "694900d0-891f-41bd-9b45-5ae957385244",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "  \"query\": \"SELECT  * FROM wikipedia_api LIMIT 3\"\n",
+    "})\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+    "print(f\"\\nEach JSON object in the response represents a row in the 
{dataSourceName} datasource.\") \n",
+    "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(), 
indent=4))\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "950b2cc4-9935-497d-a3f5-e89afcc85965",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "In addition to the query, there are a few additional things you can 
define within the payload. For a full list, see [Druid SQL 
API](https://druid.apache.org/docs/latest/querying/sql-api.html)\n",
+    "\n",
+    "This tutorial uses a context parameter and a dynamic parameter.\n",
+    "\n",
+    "Context parameters can control certain characteristics related to a 
query, such as configuring a custom timeout. For information, see [Context 
parameters](https://druid.apache.org/docs/latest/querying/query-context.html). 
In the example query that follows, the context block assigns a custom 
`sqlQueryID` to the query. Typically, the `sqlQueryId` is autogenerated. With a 
custom ID, you can use it to reference the query more easily. For example, if 
you need to cancel a query.\n",
+    "\n",
+    "\n",
+    "Druid supports dynamic parameters, so you can either define certain 
parameters within the query explicitly or insert a `?` as a placeholder and 
define it in a parameters block. In the following cell, the `?` gets bound to 
the timestmap value of `2016-06-27` at execution time. For more information, 
see [Dynamic 
parameters](https://druid.apache.org/docs/latest/querying/sql.html#dynamic-parameters).\n",
+    "\n",
+    "\n",
+    "The following cell selects rows where the `__time` column contains a 
value greater than the value defined dynamically in `parameters` and sets a 
custom `sqlQueryId`."

Review Comment:
   I think we need a section to talk about what the heck is __time and why it 
matters to people. 
   
   This is a super unique Druid concept that most other databases doesn't have.
   
   I think the important message here is:
   1. Big dataset almost always have time associated with it, for example event 
data
   2. Druid partitions your data by __time by default. This speeds up queries 
if you use WHERE __time.... filter. When you are designing applications, you 
should think about this.
   3. Traditional queries like count(*) can be slow, because that's by design. 
Instead, do count(*) where __time.... and you'll have speedier application.



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -0,0 +1,480 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Learn the basics of the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces you to the basics of the Druid API and some of 
the endpoints you might use frequently to perform tasks, such as the 
following:\n",
+    "\n",
+    "- Checking if your cluster is up\n",
+    "- Ingesting data\n",
+    "- Querying data\n",
+    "\n",
+    "Different [Druid server 
types](https://druid.apache.org/docs/latest/design/processes.html#server-types) 
 are responsible for handling different APIs for the Druid services. For 
example, the Overlord service on the Master server provides the status of a 
task. You'll also interact the Broker service on the Query Server to see what 
datasources are available. And to run queries, you'll interact with the Broker. 
The Router service on the Query servers routes API calls.\n",
+    "\n",
+    "For more information, see the [API 
reference](https://druid.apache.org/docs/latest/operations/api-reference.html), 
which is organized by server type.\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Query your data](#Query-your-data)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster. This tutorial uses the 
`micro-quickstart` config described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So 
download that and start it if you haven't already. In the root of the Druid 
folder, run the following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-micro-quickstart\n",
+    "```\n",
+    "\n",
+    "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. 
Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` 
by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it 
again with the `port` parameter included. For example, you can use the 
following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter Notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"wikipedia_api\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint`, `http_method`, and 
`payload` variables are updated in code cells to call a different Druid 
endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29c24856",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Get basic cluster information\n",
+    "\n",
+    "In this cell, you'll use the `GET /status` endpoint to return basic 
information about your cluster, such as the Druid version, loaded extensions, 
and resource consumption.\n",
+    "\n",
+    "The following cell sets `endpoint` to `/status` and updates the HTTP 
method to `GET`. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "baa140b8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = \"/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(), 
indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbeb5a63",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Get cluster health\n",
+    "\n",
+    "The `/status/health` endpoint returns `true` if your cluster is up and 
running. It's useful if you want to do something like programmatically check if 
your cluster is available. When you run the following cell, you should get 
`true` if your Druid cluster has finished starting up and is running."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e51170e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GET \n",
+    "endpoint = \"/status/health\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mResponse\\033[0m: \" + response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Now that you've confirmed that your cluster is up and running, you can 
start ingesting data. There are different ways to ingest data based on what 
your needs are. For more information, see [Ingestion 
methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
+    "\n",
+    "This tutorial uses the multi-stage query (MSQ) task engine and its 
`sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint 
accepts [SQL requests in the JSON-over-HTTP 
format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body)
 using the `query`, `context`, and `parameters` fields\n",
+    "\n",
+    "To learn more about SQL-based ingestion, see [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). 
For information about the endpoint specifically, see [SQL-based ingestion and 
multi-stage query task 
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n",
+    "\n",
+    "\n",
+    "The next cell does the following:\n",
+    "\n",
+    "- Includes a payload that inserts data from an external source into a 
table named wikipedia_api. The payload is in JSON format and included in the 
code directly. You can also store it in a file and provide the file. \n",
+    "- Saves the response to a unique variable that you can reference later to 
identify this ingestion task\n",
+    "\n",
+    "The example uses INSERT, but you could also use REPLACE. \n",
+    "\n",
+    "The MSQ task engine uses a task to ingest data. The response for the API 
includes a `taskId` and `state` for your ingestion. You can use this `taskId` 
to reference this task later on to get more information about it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "362b6a87",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
+    "          AS __time, * FROM TABLE \\\n",
+    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
+    "          PARTITIONED BY DAY\",\n",

Review Comment:
   I think we should explain what does "PARTITIONED BY DAY" means. Or 
specifically, if you partition by day, Druid creates segment files within the 
partition, and you can only replace, delete and update data at a partition 
level. You'll want to keep the partition large enough (min  500,000 rows) for 
good performance, but not so big that those operations becomes impossible.



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -0,0 +1,480 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Learn the basics of the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces you to the basics of the Druid API and some of 
the endpoints you might use frequently to perform tasks, such as the 
following:\n",
+    "\n",
+    "- Checking if your cluster is up\n",
+    "- Ingesting data\n",
+    "- Querying data\n",
+    "\n",
+    "Different [Druid server 
types](https://druid.apache.org/docs/latest/design/processes.html#server-types) 
 are responsible for handling different APIs for the Druid services. For 
example, the Overlord service on the Master server provides the status of a 
task. You'll also interact the Broker service on the Query Server to see what 
datasources are available. And to run queries, you'll interact with the Broker. 
The Router service on the Query servers routes API calls.\n",
+    "\n",
+    "For more information, see the [API 
reference](https://druid.apache.org/docs/latest/operations/api-reference.html), 
which is organized by server type.\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Query your data](#Query-your-data)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster. This tutorial uses the 
`micro-quickstart` config described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So 
download that and start it if you haven't already. In the root of the Druid 
folder, run the following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-micro-quickstart\n",
+    "```\n",
+    "\n",
+    "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. 
Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` 
by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it 
again with the `port` parameter included. For example, you can use the 
following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter Notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"wikipedia_api\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint`, `http_method`, and 
`payload` variables are updated in code cells to call a different Druid 
endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29c24856",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Get basic cluster information\n",
+    "\n",
+    "In this cell, you'll use the `GET /status` endpoint to return basic 
information about your cluster, such as the Druid version, loaded extensions, 
and resource consumption.\n",
+    "\n",
+    "The following cell sets `endpoint` to `/status` and updates the HTTP 
method to `GET`. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "baa140b8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = \"/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(), 
indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbeb5a63",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Get cluster health\n",
+    "\n",
+    "The `/status/health` endpoint returns `true` if your cluster is up and 
running. It's useful if you want to do something like programmatically check if 
your cluster is available. When you run the following cell, you should get 
`true` if your Druid cluster has finished starting up and is running."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e51170e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GET \n",
+    "endpoint = \"/status/health\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mResponse\\033[0m: \" + response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Now that you've confirmed that your cluster is up and running, you can 
start ingesting data. There are different ways to ingest data based on what 
your needs are. For more information, see [Ingestion 
methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
+    "\n",
+    "This tutorial uses the multi-stage query (MSQ) task engine and its 
`sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint 
accepts [SQL requests in the JSON-over-HTTP 
format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body)
 using the `query`, `context`, and `parameters` fields\n",
+    "\n",
+    "To learn more about SQL-based ingestion, see [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). 
For information about the endpoint specifically, see [SQL-based ingestion and 
multi-stage query task 
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n",
+    "\n",
+    "\n",
+    "The next cell does the following:\n",
+    "\n",
+    "- Includes a payload that inserts data from an external source into a 
table named wikipedia_api. The payload is in JSON format and included in the 
code directly. You can also store it in a file and provide the file. \n",
+    "- Saves the response to a unique variable that you can reference later to 
identify this ingestion task\n",
+    "\n",
+    "The example uses INSERT, but you could also use REPLACE. \n",
+    "\n",
+    "The MSQ task engine uses a task to ingest data. The response for the API 
includes a `taskId` and `state` for your ingestion. You can use this `taskId` 
to reference this task later on to get more information about it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "362b6a87",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
+    "          AS __time, * FROM TABLE \\\n",
+    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
+    "          PARTITIONED BY DAY\",\n",
+    "  \"context\": {\n",
+    "    \"maxNumTasks\": 3\n",
+    "  }\n",
+    "})\n",
+    "\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_taskId_response = response\n",
+    "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+    "print(f\"\\nInserting data into the table named {dataSourceName}\")\n",
+    "print(\"\\nThe response includes the task ID and the status: \" + 
response.text + \".\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Extract the `taskId` value from the `taskId_response` variable so that 
you can reference it later:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f578b9b2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ingestion_taskId = 
json.loads(ingestion_taskId_response.text)['taskId']\n",
+    "print(f\"This is the task ID: {ingestion_taskId}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f17892d9-a8c1-43d6-890c-7d68cd792c72",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Get the status of your task\n",
+    "\n",
+    "The following cell shows you how to get the status of your ingestion 
task. The example continues to run API calls against the endpoint to fetch the 
status until the ingestion task completes. When it's done, you'll see the JSON 
response.\n",
+    "\n",
+    "You can see basic information about your query, such as when it started 
and whether or not it's finished.\n",
+    "\n",
+    "In addition to the status, you can retrieve a full report about it if you 
want using `GET /druid/indexer/v1/task/TASK_ID/reports`. But you won't need 
that information for this tutorial."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fdbab6ae",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_status = json.loads(response.text)['status']['status']\n",
+    "# If you only want to fetch the status once and print it, \n",
+    "# uncomment the print statement and comment out the if and while loops\n",
+    "# print(json.dumps(response.json(), indent=4))\n",
+    "\n",
+    "\n",
+    "if ingestion_status == \"RUNNING\":\n",
+    "  print(\"The ingestion is running...\")\n",
+    "\n",
+    "while ingestion_status != \"SUCCESS\":\n",
+    "  response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "  ingestion_status = json.loads(response.text)['status']['status']\n",
+    "  time.sleep(15)  \n",
+    "  \n",
+    "if ingestion_status == \"SUCCESS\": \n",
+    "  print(\"The ingestion is complete:\")\n",
+    "  print(json.dumps(response.json(), indent=4))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1336ddd5-8a42-41af-8913-533316221c52",
+   "metadata": {},
+   "source": [
+    "Wait until your ingestion completes before proceeding. Depending on what 
else is happening in your Druid cluster and the resources available, ingestion 
can take some time."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b55af57-9c79-4e45-a22c-438c1b94112e",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Query your data\n",
+    "\n",
+    "When you ingest data into Druid, Druid stores the data in a datasource, 
and this datasource is what you run queries against.\n",
+    "\n",
+    "### List your datasources\n",
+    "\n",
+    "You can get a list of datasources from the 
`/druid/coordinator/v1/datasources` endpoint. Since you're just getting 
started, there should only be a single datasource, the `wikipedia_api` table 
you created earlier when you ingested external data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "959e3c9b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/coordinator/v1/datasources\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "print(\"\\nThe response is the list of datasources available in your 
Druid deployment: \" + response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "622f2158-75c9-4b12-bd8a-c92d30994c1f",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Query your data\n",

Review Comment:
   This seems to be repeat of the title "Query your data" above?



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -0,0 +1,480 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Learn the basics of the Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "This tutorial introduces you to the basics of the Druid API and some of 
the endpoints you might use frequently to perform tasks, such as the 
following:\n",
+    "\n",
+    "- Checking if your cluster is up\n",
+    "- Ingesting data\n",
+    "- Querying data\n",
+    "\n",
+    "Different [Druid server 
types](https://druid.apache.org/docs/latest/design/processes.html#server-types) 
 are responsible for handling different APIs for the Druid services. For 
example, the Overlord service on the Master server provides the status of a 
task. You'll also interact the Broker service on the Query Server to see what 
datasources are available. And to run queries, you'll interact with the Broker. 
The Router service on the Query servers routes API calls.\n",
+    "\n",
+    "For more information, see the [API 
reference](https://druid.apache.org/docs/latest/operations/api-reference.html), 
which is organized by server type.\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Get basic cluster information](#Get-basic-cluster-information)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Query your data](#Query-your-data)\n",
+    "- [Next steps](#Next-steps)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster. This tutorial uses the 
`micro-quickstart` config described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So 
download that and start it if you haven't already. In the root of the Druid 
folder, run the following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-micro-quickstart\n",
+    "```\n",
+    "\n",
+    "Finally, you'll need either JupyterLab (recommended) or Jupyter Notebook. 
Both the quickstart Druid cluster and Jupyter are deployed at `localhost:8888` 
by default, so you'll \n",
+    "need to change the port for Jupyter. To do so, stop Jupyter and start it 
again with the `port` parameter included. For example, you can use the 
following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter Notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"wikipedia_api\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2093ecf0-fb4b-405b-a216-094583580e0a",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint`, `http_method`, and 
`payload` variables are updated in code cells to call a different Druid 
endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29c24856",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Get basic cluster information\n",
+    "\n",
+    "In this cell, you'll use the `GET /status` endpoint to return basic 
information about your cluster, such as the Druid version, loaded extensions, 
and resource consumption.\n",
+    "\n",
+    "The following cell sets `endpoint` to `/status` and updates the HTTP 
method to `GET`. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "baa140b8",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = \"/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(), 
indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbeb5a63",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Get cluster health\n",
+    "\n",
+    "The `/status/health` endpoint returns `true` if your cluster is up and 
running. It's useful if you want to do something like programmatically check if 
your cluster is available. When you run the following cell, you should get 
`true` if your Druid cluster has finished starting up and is running."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e51170e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# GET \n",
+    "endpoint = \"/status/health\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mResponse\\033[0m: \" + response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1de51db8-4c51-4b7e-bb3b-734ff15c8ab3",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Now that you've confirmed that your cluster is up and running, you can 
start ingesting data. There are different ways to ingest data based on what 
your needs are. For more information, see [Ingestion 
methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods).\n",
+    "\n",
+    "This tutorial uses the multi-stage query (MSQ) task engine and its 
`sql/task` endpoint to perform SQL-based ingestion. The `/sql/task` endpoint 
accepts [SQL requests in the JSON-over-HTTP 
format](https://druid.apache.org/docs/latest/querying/sql-api.html#request-body)
 using the `query`, `context`, and `parameters` fields\n",
+    "\n",
+    "To learn more about SQL-based ingestion, see [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). 
For information about the endpoint specifically, see [SQL-based ingestion and 
multi-stage query task 
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html).\n",
+    "\n",
+    "\n",
+    "The next cell does the following:\n",
+    "\n",
+    "- Includes a payload that inserts data from an external source into a 
table named wikipedia_api. The payload is in JSON format and included in the 
code directly. You can also store it in a file and provide the file. \n",
+    "- Saves the response to a unique variable that you can reference later to 
identify this ingestion task\n",
+    "\n",
+    "The example uses INSERT, but you could also use REPLACE. \n",
+    "\n",
+    "The MSQ task engine uses a task to ingest data. The response for the API 
includes a `taskId` and `state` for your ingestion. You can use this `taskId` 
to reference this task later on to get more information about it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "362b6a87",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",

Review Comment:
   One of the scenario we need to account for is a failed ingest or users are 
continuing this tutorial. Adding a comment maybe highlight that you see the 
datasource already exist, use "REPLACE INTO" instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] 2bethere commented on a diff in pull request #13345: docs: notebook only for API tutorial

Reply via email to