vtlim commented on code in PR #13787:
URL: https://github.com/apache/druid/pull/13787#discussion_r1119290458
##########
docs/tutorials/tutorial-jupyter-index.md:
##########
@@ -22,51 +22,86 @@ title: "Jupyter Notebook tutorials"
~ under the License.
-->
-<!-- tutorial-jupyter-index.md and
examples/quickstart/juptyer-notebooks/README.md share a lot of the same
content. If you make a change in one place, update the other too. -->
+<!-- tutorial-jupyter-index.md and
examples/quickstart/juptyer-notebooks/README.md
+ share a lot of the same content. If you make a change in one place, update
the other
+ too. -->
-You can try out the Druid APIs using the Jupyter Notebook-based tutorials.
These tutorials provide snippets of Python code that you can use to run calls
against the Druid API to complete the tutorial.
+You can try out the Druid APIs using the Jupyter Notebook-based tutorials.
These
+tutorials provide snippets of Python code that you can use to run calls against
+the Druid API to complete the tutorial.
-## Prerequisites
+## Prerequisites
Make sure you meet the following requirements before starting the
Jupyter-based tutorials:
-- Python 3
+- Python 3.7 or later
+
+- The `requests` package for Python. For example, you can install it with the
following command:
-- The `requests` package for Python. For example, you can install it with the
following command:
-
```bash
pip3 install requests
```
-- JupyterLab (recommended) or Jupyter Notebook running on a non-default port.
By default, Druid and Jupyter both try to use port `8888,` so start Jupyter on
a different port.
+- JupyterLab (recommended) or Jupyter Notebook running on a non-default port.
By default, Druid
+ and Jupyter both try to use port `8888`, so start Jupyter on a different
port.
- Install JupyterLab or Notebook:
-
- ```bash
- # Install JupyterLab
- pip3 install jupyterlab
- # Install Jupyter Notebook
- pip3 install notebook
- ```
- - Start JupyterLab
-
+
+ ```bash
+ # Install JupyterLab
+ pip3 install jupyterlab
+ # Install Jupyter Notebook
+ pip3 install notebook
+ ```
+ - Start Jupyter using either JupyterLab
+ ```bash
+ # Start JupyterLab on port 3001
+ jupyter lab --port 3001
+ ```
+
+ Or using Jupyter Notebook
```bash
- # Start JupyterLab on port 3001
- jupyter lab --port 3001
- ```
- - Alternatively, start Jupyter Notebook
- ```bash
- # Start Jupyter Notebook on port 3001
- jupyter notebook --port 3001
- ```
+ # Start Jupyter Notebook on port 3001
+ jupyter notebook --port 3001
+ ```
+
+- An available Druid instance. You can use the [Quickstart
(local)](./index.md) instance. The tutorials
+ assume that you are using the quickstart, so no authentication or
authorization
+ is expected unless explicitly mentioned.
+
+ If you contribute to Druid, and work with Druid integration tests, can use a
test cluster.
+ Assume you have an environment variable, `DRUID_DEV`, which identifies your
Druid source repo.
+
+ ```bash
+ cd $DRUID_DEV
+ ./it.sh build
+ ./it.sh image
+ ./it.sh up <category>
+ ```
+
+ Replace `<catagory>` with one of the available integration test categories.
See the integration
Review Comment:
```suggestion
Replace `<category>` with one of the available integration test
categories. See the integration
```
##########
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb:
##########
@@ -0,0 +1,751 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ce2efaaa",
+ "metadata": {},
+ "source": [
+ "# Learn the Druid Python API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "This notebook provides a quick introduction to the Python wrapper around
the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are
familiar with the basics of the REST API, and the [set of operations which
Druid
provides](https://druid.apache.org/docs/latest/operations/api-reference.html).
This tutorial focuses on using Python to access those APIs rather than
explaining the APIs themselves. The APIs themselves are covered in other
notebooks that use the Python API.\n",
+ "\n",
+ "The Druid Python API is primarily intended to help with these notebook
tutorials. It can also be used in your own ad-hoc notebooks, or in a regular
Python program.\n",
+ "\n",
+ "The Druid Python API is a work in progress. The Druid team adds API
wrappers as needed for the notebook tutorials. If you find you need additional
wrappers, please feel free to add them, and post a PR to Apache Druid with your
additions.\n",
+ "\n",
+ "The API provides two levels of functions. Most are simple wrappers around
Druid's REST APIs. Others add additional code to make the API easier to use.
The SQL query interface is a prime example: extra code translates a simple SQL
query into Druid's `SQLQuery` object and interprets the results into a form
that can be displayed in a notebook.\n",
+ "\n",
+ "This notebook contains sample output to allow it to work a bit like a
reference. To run it yourself, start by using the `Kernel` → `Restart &
Clear Output` menu command to clear the sample output.\n",
+ "\n",
+ "Start by importing the `druidapi` package from the same folder as this
notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6d90ca5d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb68a838",
+ "metadata": {},
+ "source": [
+ "Next, connect to your cluster by providing the router endpoint. The code
assumes the cluster is on your local machine, using the default port. Go ahead
and change this if your setup is different.\n",
+ "\n",
+ "The API uses the router to forward messages to each of Druid's services
so that you don't have to keep track of the host and port for each service.\n",
+ "\n",
+ "The `jupyter_client()` method waits for the cluster to be ready, and sets
up the client to display tables and messages as HTML. To use this code without
waiting and without HTML formatting, use the `client()` method instead."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ae601081",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "druid = druidapi.jupyter_client('http://localhost:8888')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b4e774b",
+ "metadata": {},
+ "source": [
+ "## Status Client\n",
+ "\n",
+ "The SDK groups Druid REST API calls into categories, with a client for
each. Start with the status client."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ff16fc3b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client = druid.status"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "be992774",
+ "metadata": {},
+ "source": [
+ "Use the Python `help()` function to learn what methods are avaialble."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "03f26417",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "help(status_client)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e803c9fe",
+ "metadata": {},
+ "source": [
+ "Check the version of your cluster. Some of these notebooks illustrate
newer features available only on specific versions of Druid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2faa0d81",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d78a6c35",
+ "metadata": {},
+ "source": [
+ "You can also check which extensions are loaded in your cluster. Some
notebooks require specific extensions to be available."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1001f412",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client.properties['druid.extensions.loadList']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "769c363b",
+ "metadata": {},
+ "source": [
+ "## Display Client\n",
+ "\n",
+ "The display client performs Druid operations, then formats the results
for display in a notebook. Running SQL queries in a notebook is easy with the
display client.\n",
+ "\n",
+ "When run outside a notebook, the display client formats results as text.
The display client is the most convenient way to work with Druid in a notebook.
Most operations also have a form that returns results as Python objects rather
than displaying them. Use these methods if you write code to work with the
results. Here the goal is just to interact with Druid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ccb6e119",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display = druid.display"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d051bc5e",
+ "metadata": {},
+ "source": [
+ "Start by getting a list of schemas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dd8387e0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.schemas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8261ab0",
+ "metadata": {},
+ "source": [
+ "Then, retreive the tables (or datasources) within any schema."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "64dcb46a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.tables('INFORMATION_SCHEMA')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ff311595",
+ "metadata": {},
+ "source": [
+ "The above shows the list of datasources by default. You'll get an empty
result if you have no datasources yet."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "616770ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.tables()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7392e484",
+ "metadata": {},
+ "source": [
+ "You can easily run a query and show the results:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2c649eef",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+ "'''\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6c4e1d4",
+ "metadata": {},
+ "source": [
+ "The query above showed the same results as `tables()`. That is not
surprising: `tables()` just runs this query for you."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "59bee6e9",
+ "metadata": {},
+ "source": [
+ "## SQL Client\n",
+ "\n",
+ "While the display client is handy for simple queries, sometimes you need
more control, or want to work with the data returned from a query. For this you
use the SQL client."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "25310f40",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client = druid.sql"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b944084",
+ "metadata": {},
+ "source": [
+ "The SQL client allows you create a SQL request object that enables
passing context parameters and query parameters. Druid will work out the query
parameter type based on the Python type. Use the display client to show the
query results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dd559827",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = ?\n",
+ "'''\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_parameter('INFORMATION_SCHEMA')\n",
+ "req.add_context(\"someParameter\", \"someValue\")\n",
+ "display.sql(req)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "937dc6b1",
+ "metadata": {},
+ "source": [
+ "The request has other features for advanced use cases: see the code for
details. The query API actually returns a sql response object. Use this if you
want to get the values directly, work with the schema, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fd7a1827",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2fe6a749",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "col1 = resp.schema[0]\n",
+ "print(col1.name, col1.sql_type, col1.druid_type)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "41d27bb1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "resp.rows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "481af1f2",
+ "metadata": {},
+ "source": [
+ "The `show()` method uses this information for format an HTML table to
present the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1dbf12a8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c7c109ae",
+ "metadata": {},
+ "source": [
+ "The display and SQL clients are intened for exploratory queries. The
[pydruid](https://pythonhosted.org/pydruid/) library provides a robust way to
run native queries, to run SQL queries, and to convert the results to various
formats."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9e3be017",
+ "metadata": {},
+ "source": [
+ "## MSQ Ingestion\n",
+ "\n",
+ "The SQL client also performs MSQ-based ingestion using `INSERT` or
`REPLACE` statements. Use the extension check above to ensure that
`druid-multi-stage-query` is loaded in Druid 26. (Later versions may have MSQ
built in.)\n",
+ "\n",
+ "An MSQ query is run using a different API: `task()`. This API returns a
response object that describes the Overlord task which runs the MSQ query. For
tutorials, data is usually small enough you can wait for the ingestion to
complete. Do that with the `run_task()` call which handles the waiting. To
illustrate, here is a query that ingests a subset of columns, and includes a
few data clean-up steps:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "10f1e451",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "REPLACE INTO \"myWiki1\" OVERWRITE ALL\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " namespace,\n",
+ " page,\n",
+ " channel,\n",
+ " \"user\",\n",
+ " countryName,\n",
+ " CASE WHEN isRobot = 'true' THEN 1 ELSE 0 END AS isRobot,\n",
+ " \"added\",\n",
+ " \"delta\",\n",
+ " CASE WHEN isNew = 'true' THEN 1 ELSE 0 END AS isNew,\n",
+ " CAST(\"deltaBucket\" AS DOUBLE) AS deltaBucket,\n",
+ " \"deleted\"\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}',\n",
+ "
'[{\"name\":\"isRobot\",\"type\":\"string\"},{\"name\":\"channel\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"flags\",\"type\":\"string\"},{\"name\":\"isUnpatrolled\",\"type\":\"string\"},{\"name\":\"page\",\"type\":\"string\"},{\"name\":\"diffUrl\",\"type\":\"string\"},{\"name\":\"added\",\"type\":\"long\"},{\"name\":\"comment\",\"type\":\"string\"},{\"name\":\"commentLength\",\"type\":\"long\"},{\"name\":\"isNew\",\"type\":\"string\"},{\"name\":\"isMinor\",\"type\":\"string\"},{\"name\":\"delta\",\"type\":\"long\"},{\"name\":\"isAnonymous\",\"type\":\"string\"},{\"name\":\"user\",\"type\":\"string\"},{\"name\":\"deltaBucket\",\"type\":\"long\"},{\"name\":\"deleted\",\"type\":\"long\"},{\"name\":\"namespace\",\"type\":\"string\"},{\"name\":\"cityName\",\"type\":\"string\"},{\"name\":\"countryName\",\"type\":\"string\"},{\"name\":\"regionIsoCode\",\"type\":\"string\"},{\"name\":\"metroCode\",\"type\":\"long\"},{\"name\":\"countryIsoCode\",
\"type\":\"string\"},{\"name\":\"regionName\",\"type\":\"string\"}]'\n",
+ " )\n",
+ ")\n",
+ "PARTITIONED BY DAY\n",
+ "CLUSTERED BY namespace, page\n",
+ "'''"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d752b1d4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client.run_task(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ef4512f8",
+ "metadata": {},
+ "source": [
+ "MSQ reports task completion as soon as ingestion is done. However, it
takes a while for Druid to load the resulting segments. Wait for the table to
become ready."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "37fcedf2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client.wait_until_ready('myWiki1')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11d9c95a",
+ "metadata": {},
+ "source": [
+ "`describe_table()` lists the columns in a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b662697b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.table('myWiki1')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "936f57fb",
+ "metadata": {},
+ "source": [
+ "You can sample a few rows of data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c4cfa5dc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.sql('SELECT * FROM myWiki1 LIMIT 10')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1152f41",
+ "metadata": {},
+ "source": [
+ "## Datasource Client\n",
+ "\n",
+ "The Datasource client lets you perform operations on datasource objects.
The SQL layer allows you to get metadata and do queries. The datasource client
works with the underlying segments. Explaining the full functionality is the
topic of another notebook. For now, you can use the datasource client to clean
up the datasource created above. The `True` argument asks for \"if exists\"
semantics so you don't get an error if the datasource was alredy deleted."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fba659ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ds_client = druid.datasources\n",
+ "ds_client.drop('myWiki', True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c96fdcc6",
+ "metadata": {},
+ "source": [
+ "## Tasks Client\n",
+ "\n",
+ "Use the tasks client to work with Overlord tasks. The `run_task()` call
above actually uses the task client internally to poll Overlord."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b4f5ea17",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "task_client = druid.tasks\n",
+ "task_client.tasks()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1deaf95f",
+ "metadata": {},
+ "source": [
+ "## REST Client\n",
+ "\n",
+ "The Druid Python API starts with a REST client that itself is built on
the `requests` package. The REST client implements the common patterns seen in
the Druid REST API. You can create a client directly:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b1e55635",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from druidapi.rest import DruidRestClient\n",
+ "rest_client = DruidRestClient(\"http://localhost:8888\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dcb8055f",
+ "metadata": {},
+ "source": [
+ "Or, if you have already created the Druid client, you can reuse the
existing REST client. This is how the various other clients work internally."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "370ba76a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rest_client = druid.rest"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2654e72c",
+ "metadata": {},
+ "source": [
+ "Use the REST client if you need to make calls that are not yet wrapped by
the Python API, or if you want to do something special. To illustrate the
client, you can make some of the same calls as in the [Druid REST API
notebook](api_tutorial.ipynb).\n",
+ "\n",
+ "The REST API maintains the Druid host: you just provide the specifc URL
tail. There are methods to get or post JSON results. For example, to get status
information:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9e42dfbc",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "rest_client.get_json('/status')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "837e08b0",
+ "metadata": {},
+ "source": [
+ "A quick comparison of the three approaches (Requests, REST client, Python
client):\n",
+ "\n",
+ "Status:\n",
+ "\n",
+ "* Requests: `session.get(druid_host + '/status').json()`\n",
+ "* REST client: `rest_client.get_json('/status')`\n",
+ "* Status client: `status_client.status()`\n",
+ "\n",
+ "Health:\n",
+ "\n",
+ "* Requests: `session.get(druid_host + '/status/health').json()`\n",
+ "* REST client: `rest_client.get_json('/status/health')`\n",
+ "* Status client: `status_client.is_healthy()`\n",
+ "\n",
+ "Ingest data:\n",
+ "\n",
+ "* Requests: See the [REST tutorial](api_tutorial.ipynb)\n",
+ "* REST client: as the REST tutorial, but use
`rest_client.post_json('/druid/v2/sql/task', sql_request)` and\n",
+ "
`rest_client.get_json(f\"/druid/indexer/v1/task/{ingestion_taskId}/status\")`\n",
+ "* SQL client: `sql_client.run_task(sql)`, also a form for a full SQL
request.\n",
+ "\n",
+ "List datasources:\n",
+ "\n",
+ "* Requests: `session.get(druid_host +
'/druid/coordinator/v1/datasources').json()`\n",
+ "* REST client:
`rest_client.get_json('/druid/coordinator/v1/datasources')`\n",
+ "* Datasources client: `ds_client.names()`\n",
+ "\n",
+ "Query data, where `sql_request` is a properly-formatted `SqlResquest`
dictionary:\n",
Review Comment:
```suggestion
"Query data, where `sql_request` is a properly formatted `SqlRequest`
dictionary:\n",
```
##########
examples/quickstart/jupyter-notebooks/druidapi/rest.py:
##########
@@ -0,0 +1,265 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import requests
+from .util import dict_get
+from urllib.parse import quote
+from .error import ClientError
+
+def check_error(response):
+ '''
+ Raises an HttpError from the requests library if the response code is
neither
+ OK (200) nor Accepted (202).
+
+ Druid's REST API is inconsistent with how it resports errors. Some APIs
return
Review Comment:
```suggestion
Druid's REST API is inconsistent with how it reports errors. Some APIs
return
```
##########
examples/quickstart/jupyter-notebooks/druidapi/datasource.py:
##########
@@ -0,0 +1,81 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import requests, time
+from .consts import COORD_BASE
+from .rest import check_error
+from .util import dict_get
+
+REQ_DATASOURCES = COORD_BASE + '/datasources'
+REQ_DATASOURCE = REQ_DATASOURCES + '/{}'
+
+# Segment load status
+REQ_DATASOURCES = COORD_BASE + '/datasources'
+REQ_DS_LOAD_STATUS = REQ_DATASOURCES + '/{}/loadstatus'
+
+class DatasourceClient:
+ '''
+ Client for datasource APIs. Prefer to use the SQL to query the
Review Comment:
```suggestion
Client for datasource APIs. Prefer to use SQL to query the
```
##########
examples/quickstart/jupyter-notebooks/druidapi/rest.py:
##########
@@ -0,0 +1,265 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import requests
+from .util import dict_get
+from urllib.parse import quote
+from .error import ClientError
+
+def check_error(response):
+ '''
+ Raises an HttpError from the requests library if the response code is
neither
+ OK (200) nor Accepted (202).
+
+ Druid's REST API is inconsistent with how it resports errors. Some APIs
return
+ an error as a JSON object. Others return a text message. Still others
return
+ nothing at all. With the JSON format, sometimes the error returns an
+ 'errorMessage' field, other times only a generic 'error' field.
+
+ This method attempts to parse these variations. If the error response JSON
+ matches one of the known error formats, then raises a `ClientError` with
the error
+ message. Otherise, raises a Requests library `HTTPError` for a generic
error.
+ If the response includes a JSON payload, then the it is returned in the
json field
+ of the `HTTPError` object so that the client can perhaps decode it.
+ '''
+ code = response.status_code
+ if code == requests.codes.ok or code == requests.codes.accepted:
+ return
+ json = None
+ try:
+ json = response.json()
+ except Exception:
+ # If we can't get the JSON, raise a Requets error
+ response.raise_for_status()
+
+ # Druid JSON payload. Try to make sense of the error
+ msg = dict_get(json, 'errorMessage')
+ if not msg:
+ msg = dict_get(json, 'error')
+ if msg:
+ # We have an explanation from Druid. Raise a Client exception
+ raise ClientError(msg)
+
+ # Don't know what the Druid JSON is. Raise a Requetss exception, but
+ # add on the JSON in the hopes that the caller can make use of it.
+ try:
+ response.raise_for_status()
+ except Exception as e:
+ e.json = json
+ raise e
+
+def build_url(endpoint, req, args=None) -> str:
+ '''
+ Returns the full URL for a REST call given the relative request API and
+ optional parameters to fill placeholders within the request URL.
+
+ Parameters
+ ----------
+ endpoint: str
+ The base URL for the service.
+
+ req: str
+ Relative URL, with optional {} placeholders
+
+ args: list
+ Optional list of values to match {} placeholders in the URL.
+ '''
+ url = endpoint + req
+ if args:
+ quoted = [quote(arg) for arg in args]
+ url = url.format(*quoted)
+ return url
+
+class DruidRestClient:
+ '''
+ Wrapper around the basic Druid REST API operations using the
+ requests Python package. Handles the grunt work of building up
+ URLs, working with JSON, etc.
+
+ The REST client accepts an endpoint that represents a Druid service,
typically
+ the Router. All requests are made to this service, which means using the
service
+ URL as the base. That is, if the service is http://localhost:8888, then
+ a request for status is just '/status': the methods here build up the URL
by
+ concatenating the service endpoint with the request URL.
+ '''
+
+ def __init__(self, endpoint):
+ self.endpoint = endpoint
+ self.trace = False
+ self.session = requests.Session()
+
+ def enable_trace(self, flag=True):
+ self.trace = flag
+
+ def build_url(self, req, args=None) -> str:
+ '''
+ Returns the full URL for a REST call given the relative request API and
+ optional parameters to fill placeholders within the request URL.
+
+ Parameters
+ ----------
+ req: str
+ Relative URL, with optional {} placeholders
+
+ args: list
+ Optional list of values to match {} placeholders in the URL.
+ '''
+ return build_url(self.endpoint, req, args)
+
+ def get(self, req, args=None, params=None, require_ok=True) ->
requests.Request:
+ '''
+ Generic GET request to this service.
+
+ Parameters
+ ----------
+ req: str
+ The request URL without host, port or query string.
+ Example: `/status`
+
+ args: [str], default = None
+ Optional parameters to fill in to the URL.
+ Example: `/customer/{}`
+
+ params: dict, default = None
+ Optional map of query variables to send in
+ the URL. Query parameters are the name/values pairs
+ that appear after the `?` marker.
+
+ require_ok: bool, default = True
+ Whether to require an OK (200) response. If `True`, and
+ the request returns a different response code, then raises
+ a `RestError` exception.
+
+ Returns
+ -------
+ The `requests` `Request` object.
+ '''
+ url = self.build_url(req, args)
+ if self.trace:
+ print('GET:', url)
+ r = self.session.get(url, params=params)
+ if require_ok:
+ check_error(r)
+ return r
+
+ def get_json(self, url_tail, args=None, params=None):
+ '''
+ Generic GET request which expects a JSON response.
+ '''
+ r = self.get(url_tail, args, params)
+ return r.json()
+
+ def post(self, req, body, args=None, headers=None, require_ok=True) ->
requests.Response:
+ '''
+ Issues a POST request for the given URL on this
+ node, with the given payload and optional URL query
+ parameters.
+ '''
+ url = self.build_url(req, args)
+ if self.trace:
+ print('POST:', url)
+ print('body:', body)
+ r = self.session.post(url, data=body, headers=headers)
+ if require_ok:
+ check_error(r)
+ return r
+
+ def post_json(self, req, body, args=None, headers=None, params=None) ->
requests.Response:
+ '''
+ Issues a POST request for the given URL on this node, with a JSON
request, returning
+ the JSON response.
Review Comment:
```suggestion
Issues a POST request for the given URL on this node, with a JSON
request payload.
Returns the JSON response.
```
##########
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb:
##########
@@ -0,0 +1,751 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ce2efaaa",
+ "metadata": {},
+ "source": [
+ "# Learn the Druid Python API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "This notebook provides a quick introduction to the Python wrapper around
the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are
familiar with the basics of the REST API, and the [set of operations which
Druid
provides](https://druid.apache.org/docs/latest/operations/api-reference.html).
This tutorial focuses on using Python to access those APIs rather than
explaining the APIs themselves. The APIs themselves are covered in other
notebooks that use the Python API.\n",
+ "\n",
+ "The Druid Python API is primarily intended to help with these notebook
tutorials. It can also be used in your own ad-hoc notebooks, or in a regular
Python program.\n",
+ "\n",
+ "The Druid Python API is a work in progress. The Druid team adds API
wrappers as needed for the notebook tutorials. If you find you need additional
wrappers, please feel free to add them, and post a PR to Apache Druid with your
additions.\n",
+ "\n",
+ "The API provides two levels of functions. Most are simple wrappers around
Druid's REST APIs. Others add additional code to make the API easier to use.
The SQL query interface is a prime example: extra code translates a simple SQL
query into Druid's `SQLQuery` object and interprets the results into a form
that can be displayed in a notebook.\n",
+ "\n",
+ "This notebook contains sample output to allow it to work a bit like a
reference. To run it yourself, start by using the `Kernel` → `Restart &
Clear Output` menu command to clear the sample output.\n",
Review Comment:
```suggestion
"This notebook contains sample output to allow it to function as a
reference. To run it yourself, start by using the `Kernel` → `Restart &
Clear Output` menu command to clear the sample output.\n",
```
##########
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb:
##########
@@ -0,0 +1,751 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ce2efaaa",
+ "metadata": {},
+ "source": [
+ "# Learn the Druid Python API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "This notebook provides a quick introduction to the Python wrapper around
the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are
familiar with the basics of the REST API, and the [set of operations which
Druid
provides](https://druid.apache.org/docs/latest/operations/api-reference.html).
This tutorial focuses on using Python to access those APIs rather than
explaining the APIs themselves. The APIs themselves are covered in other
notebooks that use the Python API.\n",
+ "\n",
+ "The Druid Python API is primarily intended to help with these notebook
tutorials. It can also be used in your own ad-hoc notebooks, or in a regular
Python program.\n",
+ "\n",
+ "The Druid Python API is a work in progress. The Druid team adds API
wrappers as needed for the notebook tutorials. If you find you need additional
wrappers, please feel free to add them, and post a PR to Apache Druid with your
additions.\n",
+ "\n",
+ "The API provides two levels of functions. Most are simple wrappers around
Druid's REST APIs. Others add additional code to make the API easier to use.
The SQL query interface is a prime example: extra code translates a simple SQL
query into Druid's `SQLQuery` object and interprets the results into a form
that can be displayed in a notebook.\n",
+ "\n",
+ "This notebook contains sample output to allow it to work a bit like a
reference. To run it yourself, start by using the `Kernel` → `Restart &
Clear Output` menu command to clear the sample output.\n",
+ "\n",
+ "Start by importing the `druidapi` package from the same folder as this
notebook."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6d90ca5d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb68a838",
+ "metadata": {},
+ "source": [
+ "Next, connect to your cluster by providing the router endpoint. The code
assumes the cluster is on your local machine, using the default port. Go ahead
and change this if your setup is different.\n",
+ "\n",
+ "The API uses the router to forward messages to each of Druid's services
so that you don't have to keep track of the host and port for each service.\n",
+ "\n",
+ "The `jupyter_client()` method waits for the cluster to be ready, and sets
up the client to display tables and messages as HTML. To use this code without
waiting and without HTML formatting, use the `client()` method instead."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ae601081",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "druid = druidapi.jupyter_client('http://localhost:8888')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b4e774b",
+ "metadata": {},
+ "source": [
+ "## Status Client\n",
+ "\n",
+ "The SDK groups Druid REST API calls into categories, with a client for
each. Start with the status client."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ff16fc3b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client = druid.status"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "be992774",
+ "metadata": {},
+ "source": [
+ "Use the Python `help()` function to learn what methods are avaialble."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "03f26417",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "help(status_client)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e803c9fe",
+ "metadata": {},
+ "source": [
+ "Check the version of your cluster. Some of these notebooks illustrate
newer features available only on specific versions of Druid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2faa0d81",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d78a6c35",
+ "metadata": {},
+ "source": [
+ "You can also check which extensions are loaded in your cluster. Some
notebooks require specific extensions to be available."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1001f412",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client.properties['druid.extensions.loadList']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "769c363b",
+ "metadata": {},
+ "source": [
+ "## Display Client\n",
+ "\n",
+ "The display client performs Druid operations, then formats the results
for display in a notebook. Running SQL queries in a notebook is easy with the
display client.\n",
+ "\n",
+ "When run outside a notebook, the display client formats results as text.
The display client is the most convenient way to work with Druid in a notebook.
Most operations also have a form that returns results as Python objects rather
than displaying them. Use these methods if you write code to work with the
results. Here the goal is just to interact with Druid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ccb6e119",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display = druid.display"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d051bc5e",
+ "metadata": {},
+ "source": [
+ "Start by getting a list of schemas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dd8387e0",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.schemas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8261ab0",
+ "metadata": {},
+ "source": [
+ "Then, retreive the tables (or datasources) within any schema."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "64dcb46a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.tables('INFORMATION_SCHEMA')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ff311595",
+ "metadata": {},
+ "source": [
+ "The above shows the list of datasources by default. You'll get an empty
result if you have no datasources yet."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "616770ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.tables()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7392e484",
+ "metadata": {},
+ "source": [
+ "You can easily run a query and show the results:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2c649eef",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+ "'''\n",
+ "display.sql(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6c4e1d4",
+ "metadata": {},
+ "source": [
+ "The query above showed the same results as `tables()`. That is not
surprising: `tables()` just runs this query for you."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "59bee6e9",
+ "metadata": {},
+ "source": [
+ "## SQL Client\n",
+ "\n",
+ "While the display client is handy for simple queries, sometimes you need
more control, or want to work with the data returned from a query. For this you
use the SQL client."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "25310f40",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client = druid.sql"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b944084",
+ "metadata": {},
+ "source": [
+ "The SQL client allows you create a SQL request object that enables
passing context parameters and query parameters. Druid will work out the query
parameter type based on the Python type. Use the display client to show the
query results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dd559827",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = ?\n",
+ "'''\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_parameter('INFORMATION_SCHEMA')\n",
+ "req.add_context(\"someParameter\", \"someValue\")\n",
+ "display.sql(req)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "937dc6b1",
+ "metadata": {},
+ "source": [
+ "The request has other features for advanced use cases: see the code for
details. The query API actually returns a sql response object. Use this if you
want to get the values directly, work with the schema, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fd7a1827",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2fe6a749",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "col1 = resp.schema[0]\n",
+ "print(col1.name, col1.sql_type, col1.druid_type)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "41d27bb1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "resp.rows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "481af1f2",
+ "metadata": {},
+ "source": [
+ "The `show()` method uses this information for format an HTML table to
present the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1dbf12a8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c7c109ae",
+ "metadata": {},
+ "source": [
+ "The display and SQL clients are intened for exploratory queries. The
[pydruid](https://pythonhosted.org/pydruid/) library provides a robust way to
run native queries, to run SQL queries, and to convert the results to various
formats."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9e3be017",
+ "metadata": {},
+ "source": [
+ "## MSQ Ingestion\n",
+ "\n",
+ "The SQL client also performs MSQ-based ingestion using `INSERT` or
`REPLACE` statements. Use the extension check above to ensure that
`druid-multi-stage-query` is loaded in Druid 26. (Later versions may have MSQ
built in.)\n",
+ "\n",
+ "An MSQ query is run using a different API: `task()`. This API returns a
response object that describes the Overlord task which runs the MSQ query. For
tutorials, data is usually small enough you can wait for the ingestion to
complete. Do that with the `run_task()` call which handles the waiting. To
illustrate, here is a query that ingests a subset of columns, and includes a
few data clean-up steps:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "10f1e451",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "REPLACE INTO \"myWiki1\" OVERWRITE ALL\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " namespace,\n",
+ " page,\n",
+ " channel,\n",
+ " \"user\",\n",
+ " countryName,\n",
+ " CASE WHEN isRobot = 'true' THEN 1 ELSE 0 END AS isRobot,\n",
+ " \"added\",\n",
+ " \"delta\",\n",
+ " CASE WHEN isNew = 'true' THEN 1 ELSE 0 END AS isNew,\n",
+ " CAST(\"deltaBucket\" AS DOUBLE) AS deltaBucket,\n",
+ " \"deleted\"\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}',\n",
+ "
'[{\"name\":\"isRobot\",\"type\":\"string\"},{\"name\":\"channel\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"flags\",\"type\":\"string\"},{\"name\":\"isUnpatrolled\",\"type\":\"string\"},{\"name\":\"page\",\"type\":\"string\"},{\"name\":\"diffUrl\",\"type\":\"string\"},{\"name\":\"added\",\"type\":\"long\"},{\"name\":\"comment\",\"type\":\"string\"},{\"name\":\"commentLength\",\"type\":\"long\"},{\"name\":\"isNew\",\"type\":\"string\"},{\"name\":\"isMinor\",\"type\":\"string\"},{\"name\":\"delta\",\"type\":\"long\"},{\"name\":\"isAnonymous\",\"type\":\"string\"},{\"name\":\"user\",\"type\":\"string\"},{\"name\":\"deltaBucket\",\"type\":\"long\"},{\"name\":\"deleted\",\"type\":\"long\"},{\"name\":\"namespace\",\"type\":\"string\"},{\"name\":\"cityName\",\"type\":\"string\"},{\"name\":\"countryName\",\"type\":\"string\"},{\"name\":\"regionIsoCode\",\"type\":\"string\"},{\"name\":\"metroCode\",\"type\":\"long\"},{\"name\":\"countryIsoCode\",
\"type\":\"string\"},{\"name\":\"regionName\",\"type\":\"string\"}]'\n",
+ " )\n",
+ ")\n",
+ "PARTITIONED BY DAY\n",
+ "CLUSTERED BY namespace, page\n",
+ "'''"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d752b1d4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client.run_task(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ef4512f8",
+ "metadata": {},
+ "source": [
+ "MSQ reports task completion as soon as ingestion is done. However, it
takes a while for Druid to load the resulting segments. Wait for the table to
become ready."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "37fcedf2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client.wait_until_ready('myWiki1')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11d9c95a",
+ "metadata": {},
+ "source": [
+ "`describe_table()` lists the columns in a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b662697b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.table('myWiki1')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "936f57fb",
+ "metadata": {},
+ "source": [
+ "You can sample a few rows of data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c4cfa5dc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "display.sql('SELECT * FROM myWiki1 LIMIT 10')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1152f41",
+ "metadata": {},
+ "source": [
+ "## Datasource Client\n",
+ "\n",
+ "The Datasource client lets you perform operations on datasource objects.
The SQL layer allows you to get metadata and do queries. The datasource client
works with the underlying segments. Explaining the full functionality is the
topic of another notebook. For now, you can use the datasource client to clean
up the datasource created above. The `True` argument asks for \"if exists\"
semantics so you don't get an error if the datasource was alredy deleted."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fba659ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ds_client = druid.datasources\n",
+ "ds_client.drop('myWiki', True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c96fdcc6",
+ "metadata": {},
+ "source": [
+ "## Tasks Client\n",
+ "\n",
+ "Use the tasks client to work with Overlord tasks. The `run_task()` call
above actually uses the task client internally to poll Overlord."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b4f5ea17",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "task_client = druid.tasks\n",
+ "task_client.tasks()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1deaf95f",
+ "metadata": {},
+ "source": [
+ "## REST Client\n",
+ "\n",
+ "The Druid Python API starts with a REST client that itself is built on
the `requests` package. The REST client implements the common patterns seen in
the Druid REST API. You can create a client directly:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b1e55635",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from druidapi.rest import DruidRestClient\n",
+ "rest_client = DruidRestClient(\"http://localhost:8888\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dcb8055f",
+ "metadata": {},
+ "source": [
+ "Or, if you have already created the Druid client, you can reuse the
existing REST client. This is how the various other clients work internally."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "370ba76a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rest_client = druid.rest"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2654e72c",
+ "metadata": {},
+ "source": [
+ "Use the REST client if you need to make calls that are not yet wrapped by
the Python API, or if you want to do something special. To illustrate the
client, you can make some of the same calls as in the [Druid REST API
notebook](api_tutorial.ipynb).\n",
+ "\n",
+ "The REST API maintains the Druid host: you just provide the specifc URL
tail. There are methods to get or post JSON results. For example, to get status
information:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9e42dfbc",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "rest_client.get_json('/status')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "837e08b0",
+ "metadata": {},
+ "source": [
+ "A quick comparison of the three approaches (Requests, REST client, Python
client):\n",
+ "\n",
+ "Status:\n",
+ "\n",
+ "* Requests: `session.get(druid_host + '/status').json()`\n",
+ "* REST client: `rest_client.get_json('/status')`\n",
+ "* Status client: `status_client.status()`\n",
+ "\n",
+ "Health:\n",
+ "\n",
+ "* Requests: `session.get(druid_host + '/status/health').json()`\n",
+ "* REST client: `rest_client.get_json('/status/health')`\n",
+ "* Status client: `status_client.is_healthy()`\n",
+ "\n",
+ "Ingest data:\n",
+ "\n",
+ "* Requests: See the [REST tutorial](api_tutorial.ipynb)\n",
+ "* REST client: as the REST tutorial, but use
`rest_client.post_json('/druid/v2/sql/task', sql_request)` and\n",
+ "
`rest_client.get_json(f\"/druid/indexer/v1/task/{ingestion_taskId}/status\")`\n",
+ "* SQL client: `sql_client.run_task(sql)`, also a form for a full SQL
request.\n",
+ "\n",
+ "List datasources:\n",
+ "\n",
+ "* Requests: `session.get(druid_host +
'/druid/coordinator/v1/datasources').json()`\n",
+ "* REST client:
`rest_client.get_json('/druid/coordinator/v1/datasources')`\n",
+ "* Datasources client: `ds_client.names()`\n",
+ "\n",
+ "Query data, where `sql_request` is a properly-formatted `SqlResquest`
dictionary:\n",
+ "\n",
+ "* Requests: `session.post(druid_host + '/druid/v2/sql',
json=sql_request).json()`\n",
+ "* REST client: `rest_client.post_json('/druid/v2/sql', sql_request)`\n",
+ "* SQL Client: `sql_client.show(sql)`, where `sql` is the query text\n",
+ "\n",
+ "In general, you have to provide the all the details for the Requests
library. The REST client handles the low-level repetitious bits. The Python
clients provide methods that encapsulate the specifics of the URLS and return
formats."
Review Comment:
```suggestion
"In general, you have to provide the all the details for the Requests
library. The REST client handles the low-level repetitious bits. The Python
clients provide methods that encapsulate the specifics of the URLs and return
formats."
```
##########
examples/quickstart/jupyter-notebooks/druidapi/rest.py:
##########
@@ -0,0 +1,265 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import requests
+from .util import dict_get
+from urllib.parse import quote
+from .error import ClientError
+
+def check_error(response):
+ '''
+ Raises an HttpError from the requests library if the response code is
neither
+ OK (200) nor Accepted (202).
+
+ Druid's REST API is inconsistent with how it resports errors. Some APIs
return
+ an error as a JSON object. Others return a text message. Still others
return
+ nothing at all. With the JSON format, sometimes the error returns an
+ 'errorMessage' field, other times only a generic 'error' field.
+
+ This method attempts to parse these variations. If the error response JSON
+ matches one of the known error formats, then raises a `ClientError` with
the error
+ message. Otherise, raises a Requests library `HTTPError` for a generic
error.
+ If the response includes a JSON payload, then the it is returned in the
json field
+ of the `HTTPError` object so that the client can perhaps decode it.
+ '''
+ code = response.status_code
+ if code == requests.codes.ok or code == requests.codes.accepted:
+ return
+ json = None
+ try:
+ json = response.json()
+ except Exception:
+ # If we can't get the JSON, raise a Requets error
Review Comment:
```suggestion
# If we can't get the JSON, raise a Requests error
```
##########
examples/quickstart/jupyter-notebooks/druidapi/sql.py:
##########
@@ -134,21 +134,15 @@ def parse_rows(fmt, context, results):
return rows[header_size:]
def label_non_null_cols(results):
- if results is None or len(results) == 0:
+ if not results:
return []
is_null = {}
for key in results[0].keys():
is_null[key] = True
for row in results:
for key, value in row.items():
- if type(value) == str:
- if value != '':
- is_null[key] = False
- elif type(value) == float:
- if value != 0.0:
- is_null[key] = False
- elif value is not None:
- is_null[key] = False
+ # The following is hack to check for null values, empty strings
and numeric 0s.
+ is_null[key] = not not value
Review Comment:
neat 😮
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]