[GitHub] [druid] writer-jill commented on a diff in pull request #13787: Python Druid API for use in notebooks

via GitHub Wed, 22 Feb 2023 09:11:06 -0800


writer-jill commented on code in PR #13787:
URL: https://github.com/apache/druid/pull/13787#discussion_r1114651673



##########
docs/tutorials/tutorial-jupyter-index.md:
##########
@@ -22,50 +22,85 @@ title: "Jupyter Notebook tutorials"
   ~ under the License.
   -->
 
-<!-- tutorial-jupyter-index.md and 
examples/quickstart/juptyer-notebooks/README.md share a lot of the same 
content. If you make a change in one place, update the other too. -->
+<!-- tutorial-jupyter-index.md and 
examples/quickstart/juptyer-notebooks/README.md
+    share a lot of the same content. If you make a change in one place, update 
the other
+    too. -->
 
-You can try out the Druid APIs using the Jupyter Notebook-based tutorials. 
These tutorials provide snippets of Python code that you can use to run calls 
against the Druid API to complete the tutorial.
+You can try out the Druid APIs using the Jupyter Notebook-based tutorials. 
These
+tutorials provide snippets of Python code that you can use to run calls against
+the Druid API to complete the tutorial.
 
 ## Prerequisites 
 
 Make sure you meet the following requirements before starting the 
Jupyter-based tutorials:
 
-- Python 3 
+- Python 3
+
+- The `requests` package for Python. For example, you can install it with the 
following command:
 
-- The `requests` package for Python. For example, you can install it with the 
following command: 
-   
    ```bash
    pip3 install requests
    ```
 
-- JupyterLab (recommended) or Jupyter Notebook running on a non-default port. 
By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on 
a different port.
+- JupyterLab (recommended) or Jupyter Notebook running on a non-default port. 
By default, Druid
+  and Jupyter both try to use port `8888`, so start Jupyter on a different 
port.
 
   - Install JupyterLab or Notebook:
-  
-     ```bash
-     # Install JupyterLab
-     pip3 install jupyterlab  
-     # Install Jupyter Notebook
-     pip3 install notebook
-     ```
-  - Start Jupyter
-      - JupyterLab 
+
+    ```bash
+    # Install JupyterLab
+    pip3 install jupyterlab
+    # Install Jupyter Notebook
+    pip3 install notebook
+    ```
+  -  Start Jupyter:
+      -  JupyterLab
          ```bash
          # Start JupyterLab on port 3001
          jupyter lab --port 3001
          ```
       - Jupyter Notebook
-         ```bash
-         # Start Jupyter Notebook on port 3001
-         jupyter notebook --port 3001
-         ```
+        ```bash
+        # Start Jupyter Notebook on port 3001
+        jupyter notebook --port 3001
+        ```
+
+- An available Druid instance. You can use the [Quickstart 
(local)](./index.md) instance. The tutorials
+  assume that you are using the quickstart, so no authentication or 
authorization
+  is expected unless explicitly mentioned.
+
+  Druid developers can use a cluster launched for an integration test:
+
+  ```bash
+  cd $DRUID_DEV
+  ./it.sh build
+  ./it.sh image
+  ./it.sh up <category>
+  ```
+
+  Where `DRUID_DEV` points to your Druid source code repo, and `<catagory>` is 
one

Review Comment:
   ```suggestion
     Replace `DRUID_DEV` with your Druid source code repo, and `<category>` 
with one
   ```



##########
examples/quickstart/jupyter-notebooks/README.md:
##########
@@ -57,33 +65,25 @@ Make sure you meet the following requirements before 
starting the Jupyter-based
         jupyter notebook --port 3001
         ```
 
-- An available Druid instance. You can use the `micro-quickstart` 
configuration described in [Quickstart 
(local)](../../../docs/tutorials/index.md). The tutorials assume that you are 
using the quickstart, so no authentication or authorization is expected unless 
explicitly mentioned.
-
-## Tutorials
-
-The notebooks are located in the [apache/druid 
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).
 You can either clone the repo or download the notebooks you want individually. 
-
-The links that follow are the raw GitHub URLs, so you can use them to download 
the notebook directly, such as with `wget`, or manually through your web 
browser. Note that if you save the file from your web browser, make sure to 
remove the `.txt` extension.
-
-- [Introduction to the Druid API](api-tutorial.ipynb) walks you through some 
of the basics related to the Druid API and several endpoints.
-
-## Contributing
-
-If you build a Jupyter tutorial, you need to do a few things to add it to the 
docs in addition to saving the notebook in this directory. The process requires 
two PRs to the repo.
-
-For the first PR, do the following:
-
-1. Clear the outputs from your notebook before you make the PR. You can use 
the following command: 
+- An available Druid instance. You can use the `micro-quickstart` configuration
+  described in 
[Quickstart](https://druid.apache.org/docs/latest/tutorials/index.html).
+  The tutorials assume that you are using the quickstart, so no authentication 
or authorization
+  is expected unless explicitly mentioned.
 
-   ```bash
-   jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace 
./path/to/notebook/notebookName.ipynb
-   ```
+  Druid developers can use a cluster launched for an integration test:
 
-2. Create the PR as you normally would. Make sure to note that this PR is the 
one that contains only the Jupyter notebook and that there will be a subsequent 
PR that updates related pages.
+  ```bash
+  cd $DRUID_DEV
+  ./it.sh build
+  ./it.sh image
+  ./it.sh up <category>
+  ```
 
-3. After this first PR is merged, grab the "raw" URL for the file from GitHub. 
For example, navigate to the file in the GitHub web UI and select **Raw**. Use 
the URL for this in the second PR as the download link.
+  Where `DRUID_DEV` points to your Druid source code repo, and `<catagory>` is 
one

Review Comment:
   ```suggestion
     Replace `DRUID_DEV` with your Druid source code repo, and `<category>` 
with one
   ```



##########
docs/tutorials/tutorial-jupyter-index.md:
##########
@@ -22,50 +22,85 @@ title: "Jupyter Notebook tutorials"
   ~ under the License.
   -->
 
-<!-- tutorial-jupyter-index.md and 
examples/quickstart/juptyer-notebooks/README.md share a lot of the same 
content. If you make a change in one place, update the other too. -->
+<!-- tutorial-jupyter-index.md and 
examples/quickstart/juptyer-notebooks/README.md
+    share a lot of the same content. If you make a change in one place, update 
the other
+    too. -->
 
-You can try out the Druid APIs using the Jupyter Notebook-based tutorials. 
These tutorials provide snippets of Python code that you can use to run calls 
against the Druid API to complete the tutorial.
+You can try out the Druid APIs using the Jupyter Notebook-based tutorials. 
These
+tutorials provide snippets of Python code that you can use to run calls against
+the Druid API to complete the tutorial.
 
 ## Prerequisites 
 
 Make sure you meet the following requirements before starting the 
Jupyter-based tutorials:
 
-- Python 3 
+- Python 3
+
+- The `requests` package for Python. For example, you can install it with the 
following command:
 
-- The `requests` package for Python. For example, you can install it with the 
following command: 
-   
    ```bash
    pip3 install requests
    ```
 
-- JupyterLab (recommended) or Jupyter Notebook running on a non-default port. 
By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on 
a different port.
+- JupyterLab (recommended) or Jupyter Notebook running on a non-default port. 
By default, Druid
+  and Jupyter both try to use port `8888`, so start Jupyter on a different 
port.
 
   - Install JupyterLab or Notebook:
-  
-     ```bash
-     # Install JupyterLab
-     pip3 install jupyterlab  
-     # Install Jupyter Notebook
-     pip3 install notebook
-     ```
-  - Start Jupyter
-      - JupyterLab 
+
+    ```bash
+    # Install JupyterLab
+    pip3 install jupyterlab
+    # Install Jupyter Notebook
+    pip3 install notebook
+    ```
+  -  Start Jupyter:
+      -  JupyterLab
          ```bash
          # Start JupyterLab on port 3001
          jupyter lab --port 3001
          ```
       - Jupyter Notebook
-         ```bash
-         # Start Jupyter Notebook on port 3001
-         jupyter notebook --port 3001
-         ```
+        ```bash
+        # Start Jupyter Notebook on port 3001
+        jupyter notebook --port 3001
+        ```
+
+- An available Druid instance. You can use the [Quickstart 
(local)](./index.md) instance. The tutorials
+  assume that you are using the quickstart, so no authentication or 
authorization
+  is expected unless explicitly mentioned.
+
+  Druid developers can use a cluster launched for an integration test:

Review Comment:
   ```suggestion
     If you're a Druid developer, you can use a cluster launched for an 
integration test:
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -151,7 +209,18 @@
    "source": [
     "### Get cluster health\n",
     "\n",
-    "The `/status/health` endpoint returns `true` if your cluster is up and 
running. It's useful if you want to do something like programmatically check if 
your cluster is available. When you run the following cell, you should get 
`true` if your Druid cluster has finished starting up and is running."
+    "The `/status/health` endpoint returns JSON `true` if your cluster is up 
and running. It's useful if you want to do something like programmatically 
check if your cluster is available. When you run the following cell, you should 
get the `True` Python value if your Druid cluster has finished starting up and 
is running."

Review Comment:
   ```suggestion
       "The `/status/health` endpoint returns JSON `true` if your cluster is up 
and running. It's useful if you want to do something like programmatically 
check if your cluster is available. When you run the following cell, you should 
receive the `True` Python value if your Druid cluster has finished starting up 
and is running."
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -376,25 +546,54 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "694900d0-891f-41bd-9b45-5ae957385244",
+   "id": "44868ff9",
    "metadata": {},
    "outputs": [],
    "source": [
-    "endpoint = \"/druid/v2/sql\"\n",
-    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
-    "http_method = \"POST\"\n",
-    "\n",
-    "payload = json.dumps({\n",
-    "  \"query\": \"SELECT  * FROM wikipedia_api LIMIT 3\"\n",
-    "})\n",
-    "headers = {'Content-Type': 'application/json'}\n",
-    "\n",
-    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
-    "\n",
-    "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
-    "print(f\"\\nEach JSON object in the response represents a row in the 
{dataSourceName} datasource.\") \n",
-    "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(), 
indent=4))\n",
-    "\n"
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b2b366ad",
+   "metadata": {},
+   "source": [
+    "As we did for ingestion, define a query, then create a `SQLRequest` 
object as a Python `dict`."

Review Comment:
   ```suggestion
       "Define a query, then create a `SQLRequest` object as a Python `dict`."
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -210,60 +286,141 @@
     "\n",
     "To learn more, see 
[Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html).\n",
     "\n",
-    "Now, run the next cell to start the ingestion."
+    "The query uses `INSERT INTO`. If you have an existing datasource with the 
name `wikipedia_api`, use `REPLACE INTO` instead."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "362b6a87",
+   "id": "90c34908",
    "metadata": {},
    "outputs": [],
    "source": [
-    "endpoint = \"/druid/v2/sql/task\"\n",
-    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
-    "http_method = \"POST\"\n",
-    "\n",
+    "sql = '''\n",
+    "INSERT INTO wikipedia_api \n",
+    "SELECT \n",
+    "  TIME_PARSE(\"timestamp\") AS __time,\n",
+    "  * \n",
+    "FROM TABLE(EXTERN(\n",
+    "  '{\"type\": \"http\", \"uris\": 
[\"https://druid.apache.org/data/wikipedia.json.gz\"]}', \n",
+    "  '{\"type\": \"json\"}', \n",
+    "  '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", 
\"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, 
{\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", 
\"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, 
{\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", 
\"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": 
\"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": 
\"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": 
\"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": 
\"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": 
\"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": 
\"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": 
\"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": 
\"string\"}, {\"name\": \"regionIsoCode\", \
 "type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, 
{\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": 
\"string\"}]'\n",
+    "  ))\n",
+    "PARTITIONED BY DAY\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7dcddd7",
+   "metadata": {},
+   "source": [
+    "The query is included inline here. You can also store it in a file and 
provide the file.\n",
     "\n",
-    "# The query uses INSERT INTO. If you have an existing datasource with the 
name wikipedia_api, use REPLACE INTO instead.\n",
-    "payload = json.dumps({\n",
-    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
-    "          AS __time, * FROM TABLE \\\n",
-    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
-    "          PARTITIONED BY DAY\",\n",
-    "  \"context\": {\n",
-    "    \"maxNumTasks\": 3\n",
+    "Just as Requests can convert the response from JSON to Python, it can 
also convert the request from Python to JSON. The next cell builds up a Python 
map that represents the Druid `SqlRequest` object. In this case, we need the 
query and a context variable to set the task count to 3."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6e82c0a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "    'query': sql,\n",
+    "    'context': {\n",
+    "      'maxNumTasks': 3\n",
     "  }\n",
-    "})\n",
-    "\n",
-    "headers = {'Content-Type': 'application/json'}\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6aa7f230",
+   "metadata": {},
+   "source": [
+    "With the SQL request ready, we now use the the `json` parameter to the 
`Session` `post` method to send a `POST` request with our object as the 
payload. The result is a Requests `Response` which we save in a variable.\n",
     "\n",
-    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
-    "ingestion_taskId_response = response\n",
-    "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
-    "print(f\"\\nInserting data into the table named {dataSourceName}\")\n",
-    "print(\"\\nThe response includes the task ID and the status: \" + 
response.text + \".\")"
+    "Now, run the next cell to start the ingestion."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e2939a07",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = session.post(endpoint, json=sql_request)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "c1235e99-be72-40b0-b7f9-9e860e4932d7",
+   "id": "9ba1821f",
    "metadata": {
     "tags": []
    },
    "source": [
-    "Extract the `taskId` value from the `taskId_response` variable so that 
you can reference it later:"
+    "The MSQ task engine uses a task to ingest data. The response for the API 
includes a `taskId` and `state` for your ingestion. You can use this `taskId` 
to reference this task later on to get more information about it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9cc2e45",
+   "metadata": {},
+   "source": [
+    "It is good to check that the response suceeded by checking the return 
status. It should be 20x. (202 means \"accepted\".) If the response is 
something else (4xx, say) then display `response.text` to see the error 
message."

Review Comment:
   ```suggestion
       "It is good practice to ensure that the response succeeded by checking 
the return status. The status should be 20x. (202 means \"accepted\".) If the 
response is something else, such as 4xx, display `response.text` to see the 
error message."
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -119,7 +144,28 @@
     "\n",
     "In this cell, you'll use the `GET /status` endpoint to return basic 
information about your cluster, such as the Druid version, loaded extensions, 
and resource consumption.\n",
     "\n",
-    "The following cell sets `endpoint` to `/status` and updates the HTTP 
method to `GET`. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."
+    "The following cell sets `endpoint` to `/status` and updates the HTTP 
method to `GET`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a1b453e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/status'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e853795",
+   "metadata": {},
+   "source": [
+    "The Requests `Session` has a `get()` method that posts an HTTP `GET` 
request. The method takes multiple arguments. Here we only need the URL. The 
method returns a Requests `Response` object, which can convert the returned 
result to JSON. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."

Review Comment:
   ```suggestion
       "The Requests `Session` has a `get()` method that posts an HTTP `GET` 
request. The method takes multiple arguments. Here you only need the URL. The 
method returns a Requests `Response` object, which can convert the returned 
result to JSON. When you run the cell, you should get a response that starts 
with the version number of your Druid deployment."
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -91,21 +102,35 @@
    "outputs": [],
    "source": [
     "import requests\n",
-    "import json\n",
     "\n",
-    "# druid_host is the hostname and port for your Druid deployment. \n",
-    "# In a distributed environment, you can point to other Druid services. In 
this tutorial, you'll use the Router service  as the `druid_host`. \n",
     "druid_host = \"http://localhost:8888\"\n";,
-    "dataSourceName = \"wikipedia_api\"\n",
-    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+    "druid_host"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a22c69c8",
+   "metadata": {},
+   "source": [
+    "If your cluster is secure, you'll need to provide authorization 
information on each request. That can be automated by using the Requests 
`session` feature. Although we assume no authorization here, we'll still use a 
session to show how it is done."

Review Comment:
   ```suggestion
       "If your cluster is secure, you'll need to provide authorization 
information on each request. You can automate it by using the Requests 
`session` feature. Although this tutorial assumes no authorization, the 
configuration below defines a session as an example."
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -131,15 +177,27 @@
    },
    "outputs": [],
    "source": [
-    "endpoint = \"/status\"\n",
-    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
-    "http_method = \"GET\"\n",
-    "\n",
-    "payload = {}\n",
-    "headers = {}\n",
-    "\n",
-    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
-    "print(\"\\033[1mResponse\\033[0m: : \\n\" + json.dumps(response.json(), 
indent=4))"
+    "response = session.get(endpoint)\n",
+    "json = response.json()\n",
+    "json"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de82029e",
+   "metadata": {},
+   "source": [
+    "Since Druid's responses are JSON, and Requests converted the JSON to a 
set of Python dicts (maps) and arrays, we can easily pull out the information 
we want. For example, to seee just the version:"

Review Comment:
   "you" instead of "we"



##########
examples/quickstart/jupyter-notebooks/README.md:
##########
@@ -57,33 +65,25 @@ Make sure you meet the following requirements before 
starting the Jupyter-based
         jupyter notebook --port 3001
         ```
 
-- An available Druid instance. You can use the `micro-quickstart` 
configuration described in [Quickstart 
(local)](../../../docs/tutorials/index.md). The tutorials assume that you are 
using the quickstart, so no authentication or authorization is expected unless 
explicitly mentioned.
-
-## Tutorials
-
-The notebooks are located in the [apache/druid 
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).
 You can either clone the repo or download the notebooks you want individually. 
-
-The links that follow are the raw GitHub URLs, so you can use them to download 
the notebook directly, such as with `wget`, or manually through your web 
browser. Note that if you save the file from your web browser, make sure to 
remove the `.txt` extension.
-
-- [Introduction to the Druid API](api-tutorial.ipynb) walks you through some 
of the basics related to the Druid API and several endpoints.
-
-## Contributing
-
-If you build a Jupyter tutorial, you need to do a few things to add it to the 
docs in addition to saving the notebook in this directory. The process requires 
two PRs to the repo.
-
-For the first PR, do the following:
-
-1. Clear the outputs from your notebook before you make the PR. You can use 
the following command: 
+- An available Druid instance. You can use the `micro-quickstart` configuration
+  described in 
[Quickstart](https://druid.apache.org/docs/latest/tutorials/index.html).
+  The tutorials assume that you are using the quickstart, so no authentication 
or authorization
+  is expected unless explicitly mentioned.
 
-   ```bash
-   jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace 
./path/to/notebook/notebookName.ipynb
-   ```
+  Druid developers can use a cluster launched for an integration test:

Review Comment:
   ```suggestion
     If you're a Druid developer, you can use a cluster launched for an 
integration test:
   ```



##########
examples/quickstart/jupyter-notebooks/README.md:
##########
@@ -57,33 +65,25 @@ Make sure you meet the following requirements before 
starting the Jupyter-based
         jupyter notebook --port 3001
         ```
 
-- An available Druid instance. You can use the `micro-quickstart` 
configuration described in [Quickstart 
(local)](../../../docs/tutorials/index.md). The tutorials assume that you are 
using the quickstart, so no authentication or authorization is expected unless 
explicitly mentioned.
-
-## Tutorials
-
-The notebooks are located in the [apache/druid 
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).
 You can either clone the repo or download the notebooks you want individually. 
-
-The links that follow are the raw GitHub URLs, so you can use them to download 
the notebook directly, such as with `wget`, or manually through your web 
browser. Note that if you save the file from your web browser, make sure to 
remove the `.txt` extension.
-
-- [Introduction to the Druid API](api-tutorial.ipynb) walks you through some 
of the basics related to the Druid API and several endpoints.
-
-## Contributing
-
-If you build a Jupyter tutorial, you need to do a few things to add it to the 
docs in addition to saving the notebook in this directory. The process requires 
two PRs to the repo.
-
-For the first PR, do the following:
-
-1. Clear the outputs from your notebook before you make the PR. You can use 
the following command: 
+- An available Druid instance. You can use the `micro-quickstart` configuration
+  described in 
[Quickstart](https://druid.apache.org/docs/latest/tutorials/index.html).
+  The tutorials assume that you are using the quickstart, so no authentication 
or authorization
+  is expected unless explicitly mentioned.
 
-   ```bash
-   jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace 
./path/to/notebook/notebookName.ipynb
-   ```
+  Druid developers can use a cluster launched for an integration test:
 
-2. Create the PR as you normally would. Make sure to note that this PR is the 
one that contains only the Jupyter notebook and that there will be a subsequent 
PR that updates related pages.
+  ```bash
+  cd $DRUID_DEV
+  ./it.sh build
+  ./it.sh image
+  ./it.sh up <category>
+  ```
 
-3. After this first PR is merged, grab the "raw" URL for the file from GitHub. 
For example, navigate to the file in the GitHub web UI and select **Raw**. Use 
the URL for this in the second PR as the download link.
+  Where `DRUID_DEV` points to your Druid source code repo, and `<catagory>` is 
one
+  of the available integration test categories. See the integration test 
`README.md`
+  for details.
 
-For the second PR, do the following:
+## Continue in Jupyter
 
-1. Update the list of [Tutorials](#tutorials) on this page and in the [ 
Jupyter tutorial index 
page](../../../docs/tutorials/tutorial-jupyter-index.md#tutorials) in the 
`docs/tutorials` directory. 
-2. Update `tutorial-jupyter-index.md` and provide the URL to the raw version 
of the file that becomes available after the first PR is merged.
+Fire up Jupyter (see above) and navigate to the "- START HERE -" page for more

Review Comment:
   ```suggestion
   Start Jupyter (see above) and navigate to the "- START HERE -" page for more
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -210,60 +286,141 @@
     "\n",
     "To learn more, see 
[Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html).\n",
     "\n",
-    "Now, run the next cell to start the ingestion."
+    "The query uses `INSERT INTO`. If you have an existing datasource with the 
name `wikipedia_api`, use `REPLACE INTO` instead."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "362b6a87",
+   "id": "90c34908",
    "metadata": {},
    "outputs": [],
    "source": [
-    "endpoint = \"/druid/v2/sql/task\"\n",
-    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
-    "http_method = \"POST\"\n",
-    "\n",
+    "sql = '''\n",
+    "INSERT INTO wikipedia_api \n",
+    "SELECT \n",
+    "  TIME_PARSE(\"timestamp\") AS __time,\n",
+    "  * \n",
+    "FROM TABLE(EXTERN(\n",
+    "  '{\"type\": \"http\", \"uris\": 
[\"https://druid.apache.org/data/wikipedia.json.gz\"]}', \n",
+    "  '{\"type\": \"json\"}', \n",
+    "  '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", 
\"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, 
{\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", 
\"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, 
{\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", 
\"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": 
\"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": 
\"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": 
\"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": 
\"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": 
\"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": 
\"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": 
\"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": 
\"string\"}, {\"name\": \"regionIsoCode\", \
 "type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, 
{\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": 
\"string\"}]'\n",
+    "  ))\n",
+    "PARTITIONED BY DAY\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7dcddd7",
+   "metadata": {},
+   "source": [
+    "The query is included inline here. You can also store it in a file and 
provide the file.\n",
     "\n",
-    "# The query uses INSERT INTO. If you have an existing datasource with the 
name wikipedia_api, use REPLACE INTO instead.\n",
-    "payload = json.dumps({\n",
-    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
-    "          AS __time, * FROM TABLE \\\n",
-    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
-    "          PARTITIONED BY DAY\",\n",
-    "  \"context\": {\n",
-    "    \"maxNumTasks\": 3\n",
+    "Just as Requests can convert the response from JSON to Python, it can 
also convert the request from Python to JSON. The next cell builds up a Python 
map that represents the Druid `SqlRequest` object. In this case, we need the 
query and a context variable to set the task count to 3."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b6e82c0a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "    'query': sql,\n",
+    "    'context': {\n",
+    "      'maxNumTasks': 3\n",
     "  }\n",
-    "})\n",
-    "\n",
-    "headers = {'Content-Type': 'application/json'}\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6aa7f230",
+   "metadata": {},
+   "source": [
+    "With the SQL request ready, we now use the the `json` parameter to the 
`Session` `post` method to send a `POST` request with our object as the 
payload. The result is a Requests `Response` which we save in a variable.\n",

Review Comment:
   ```suggestion
       "With the SQL request ready, use the the `json` parameter to the 
`Session` `post` method to send a `POST` request with our object as the 
payload. The result is a Requests `Response` which you save in a variable.\n",
   ```



##########
examples/quickstart/jupyter-notebooks/api-tutorial.ipynb:
##########
@@ -210,60 +286,141 @@
     "\n",
     "To learn more, see 
[Partitioning](https://druid.apache.org/docs/latest/ingestion/partitioning.html).\n",
     "\n",
-    "Now, run the next cell to start the ingestion."
+    "The query uses `INSERT INTO`. If you have an existing datasource with the 
name `wikipedia_api`, use `REPLACE INTO` instead."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "362b6a87",
+   "id": "90c34908",
    "metadata": {},
    "outputs": [],
    "source": [
-    "endpoint = \"/druid/v2/sql/task\"\n",
-    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
-    "http_method = \"POST\"\n",
-    "\n",
+    "sql = '''\n",
+    "INSERT INTO wikipedia_api \n",
+    "SELECT \n",
+    "  TIME_PARSE(\"timestamp\") AS __time,\n",
+    "  * \n",
+    "FROM TABLE(EXTERN(\n",
+    "  '{\"type\": \"http\", \"uris\": 
[\"https://druid.apache.org/data/wikipedia.json.gz\"]}', \n",
+    "  '{\"type\": \"json\"}', \n",
+    "  '[{\"name\": \"added\", \"type\": \"long\"}, {\"name\": \"channel\", 
\"type\": \"string\"}, {\"name\": \"cityName\", \"type\": \"string\"}, 
{\"name\": \"comment\", \"type\": \"string\"}, {\"name\": \"commentLength\", 
\"type\": \"long\"}, {\"name\": \"countryIsoCode\", \"type\": \"string\"}, 
{\"name\": \"countryName\", \"type\": \"string\"}, {\"name\": \"deleted\", 
\"type\": \"long\"}, {\"name\": \"delta\", \"type\": \"long\"}, {\"name\": 
\"deltaBucket\", \"type\": \"string\"}, {\"name\": \"diffUrl\", \"type\": 
\"string\"}, {\"name\": \"flags\", \"type\": \"string\"}, {\"name\": 
\"isAnonymous\", \"type\": \"string\"}, {\"name\": \"isMinor\", \"type\": 
\"string\"}, {\"name\": \"isNew\", \"type\": \"string\"}, {\"name\": 
\"isRobot\", \"type\": \"string\"}, {\"name\": \"isUnpatrolled\", \"type\": 
\"string\"}, {\"name\": \"metroCode\", \"type\": \"string\"}, {\"name\": 
\"namespace\", \"type\": \"string\"}, {\"name\": \"page\", \"type\": 
\"string\"}, {\"name\": \"regionIsoCode\", \
 "type\": \"string\"}, {\"name\": \"regionName\", \"type\": \"string\"}, 
{\"name\": \"timestamp\", \"type\": \"string\"}, {\"name\": \"user\", \"type\": 
\"string\"}]'\n",
+    "  ))\n",
+    "PARTITIONED BY DAY\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f7dcddd7",
+   "metadata": {},
+   "source": [
+    "The query is included inline here. You can also store it in a file and 
provide the file.\n",
     "\n",
-    "# The query uses INSERT INTO. If you have an existing datasource with the 
name wikipedia_api, use REPLACE INTO instead.\n",
-    "payload = json.dumps({\n",
-    "\"query\": \"INSERT INTO wikipedia_api SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
-    "          AS __time, * FROM TABLE \\\n",
-    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
-    "          PARTITIONED BY DAY\",\n",
-    "  \"context\": {\n",
-    "    \"maxNumTasks\": 3\n",
+    "Just as Requests can convert the response from JSON to Python, it can 
also convert the request from Python to JSON. The next cell builds up a Python 
map that represents the Druid `SqlRequest` object. In this case, we need the 
query and a context variable to set the task count to 3."

Review Comment:
   ```suggestion
       "As well as converting the response from JSON to Python, Requests can 
also convert the request from Python to JSON. The next cell builds up a Python 
map that represents the Druid `SqlRequest` object. In this case, you need the 
query and a context variable to set the task count to 3."
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] writer-jill commented on a diff in pull request #13787: Python Druid API for use in notebooks

Reply via email to