techdocsmith commented on code in PR #13787:
URL: https://github.com/apache/druid/pull/13787#discussion_r1116368644
##########
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb:
##########
@@ -0,0 +1,1281 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ce2efaaa",
+ "metadata": {},
+ "source": [
+ "# Learn the Druid Python API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "This notebook provides a quick introduction to the Python wrapper around
the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are
familiar with the basics of the REST API, and the [set of operations which
Druid
provides](https://druid.apache.org/docs/latest/operations/api-reference.html).
This tutorial focuses on using Python to access those APIs rather than
explaining the APIs themselves. The APIs themselves are covered in other
notebooks that use the Python API.\n",
+ "\n",
+ "The Druid Python API is primarily intended to help with these notebook
tutorials. It can also be used in your own ad-hoc notebooks, or in a regular
Python program.\n",
+ "\n",
+ "The Druid Python API is a work in progress. The Druid team adds API
wrappers as needed for the notebook tutorials. If you find you need additional
wrappers, please feel free to add them, and post a PR to Apache Druid with your
additions.\n",
+ "\n",
+ "The API provides two levels of functions. Most are simple wrappers around
Druid's REST APIs. Others add additional code to make the API easier to use.
The SQL query interface is a prime example: extra code translates a simple SQL
query into Druid's `SQLQuery` object and interprets the results into a form
that can be displayed in a notebook.\n",
+ "\n",
+ "Start by importing the `druidapi` package from the same folder as this
notebook. The `styles()` calls adds some CSS styles needed to display results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "6d90ca5d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "<style>\n",
+ " .druid table {\n",
+ " border: 1px solid black;\n",
+ " border-collapse: collapse;\n",
+ " }\n",
+ "\n",
+ " .druid th, .druid td {\n",
+ " padding: 4px 1em ;\n",
+ " text-align: left;\n",
+ " }\n",
+ "\n",
+ " td.druid-right, th.druid-right {\n",
+ " text-align: right;\n",
+ " }\n",
+ "\n",
+ " td.druid-center, th.druid-center {\n",
+ " text-align: center;\n",
+ " }\n",
+ "\n",
+ " .druid .druid-left {\n",
+ " text-align: left;\n",
+ " }\n",
+ "\n",
+ " .druid-alert {\n",
+ " color: red;\n",
+ " }\n",
+ "</style>\n"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "<style>\n",
+ " .druid table {\n",
+ " border: 1px solid black;\n",
+ " border-collapse: collapse;\n",
+ " }\n",
+ "\n",
+ " .druid th, .druid td {\n",
+ " padding: 4px 1em ;\n",
+ " text-align: left;\n",
+ " }\n",
+ "\n",
+ " td.druid-right, th.druid-right {\n",
+ " text-align: right;\n",
+ " }\n",
+ "\n",
+ " td.druid-center, th.druid-center {\n",
+ " text-align: center;\n",
+ " }\n",
+ "\n",
+ " .druid .druid-left {\n",
+ " text-align: left;\n",
+ " }\n",
+ "\n",
+ " .druid-alert {\n",
+ " color: red;\n",
+ " }\n",
+ "</style>\n"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "import druidapi\n",
+ "druidapi.styles()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fb68a838",
+ "metadata": {},
+ "source": [
+ "Next, connect to your cluster by providing the router endpoint. The code
assumes the cluster is on your local machine, using the default port. Go ahead
and change this if your setup is different.\n",
+ "\n",
+ "The API uses the router to forward messages to each of Druid's services
so that you don't have to keep track of the host and port for each service."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "ae601081",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "druid = druidapi.client('http://localhost:8888')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b4e774b",
+ "metadata": {},
+ "source": [
+ "## Status Client\n",
+ "\n",
+ "The SDK groups Druid REST API calls into categories, with a client for
each. Start with the status client."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "ff16fc3b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client = druid.status()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "be992774",
+ "metadata": {},
+ "source": [
+ "Use the Python `help()` function to learn what methods are avaialble."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "03f26417",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Help on StatusClient in module druidapi.status object:\n",
+ "\n",
+ "class StatusClient(builtins.object)\n",
+ " | StatusClient(rest_client)\n",
+ " | \n",
+ " | Client for status APIs. These APIs are available on all nodes.\n",
+ " | If used with the router, they report the status of just the
router.\n",
+ " | \n",
+ " | Methods defined here:\n",
+ " | \n",
+ " | __init__(self, rest_client)\n",
+ " | Initialize self. See help(type(self)) for accurate
signature.\n",
+ " | \n",
+ " | brokers(self)\n",
+ " | \n",
+ " | in_cluster(self)\n",
+ " | Returns `True` if the node is visible wihtin the cluster,
`False` if not.\n",
+ " | (That is, returns the value of the `{\"selfDiscovered\":
true/false}`\n",
+ " | field in the response.\n",
+ " | \n",
+ " | GET `/status/selfDiscovered/status`\n",
+ " | \n",
+ " | See
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n",
+ " | \n",
+ " | is_healthy(self) -> bool\n",
+ " | Returns `True` if the node is healthy, an exception
otherwise.\n",
+ " | Useful for automated health checks.\n",
+ " | \n",
+ " | GET `/status/health`\n",
+ " | \n",
+ " | See
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n",
+ " | \n",
+ " | properties(self) -> map\n",
+ " | Returns the effective set of Java properties used by the
service, including\n",
+ " | system properties and properties from the
`common_runtime.propeties` and\n",
+ " | `runtime.properties` files.\n",
+ " | \n",
+ " | GET `/status/properties`\n",
+ " | \n",
+ " | See
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n",
+ " | \n",
+ " | status(self)\n",
+ " | Returns the Druid version, loaded extensions, memory used,
total memory \n",
+ " | and other useful information about the process.\n",
+ " | \n",
+ " | GET `/status`\n",
+ " | \n",
+ " | See
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n",
+ " | \n",
+ " | version(self)\n",
+ " | \n",
+ " | wait_until_ready(self)\n",
+ " | \n",
+ " |
----------------------------------------------------------------------\n",
+ " | Data descriptors defined here:\n",
+ " | \n",
+ " | __dict__\n",
+ " | dictionary for instance variables (if defined)\n",
+ " | \n",
+ " | __weakref__\n",
+ " | list of weak references to the object (if defined)\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "help(status_client)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "70f3d578",
+ "metadata": {},
+ "source": [
+ "Druid servers return unexpected results if you make REST calls while
Druid starts up. The following will run until the server is ready. If you
forgot to start your server, or the URL above is wrong, this will hang forever.
Use the Kernel → Interrupt command to break out of the function. (Or,
start your server. If your server refuses to start, then this Jupyter Notebook
may be running on port 8888. See the [README](README.md) for how to start on a
different port.)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "114ed0d1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "status_client.wait_until_ready()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e803c9fe",
+ "metadata": {},
+ "source": [
+ "Check the version of your cluster. Some of these notebooks illustrate
newer features available only on specific versions of Druid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "2faa0d81",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'26.0.0-SNAPSHOT'"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "status_client.version()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d78a6c35",
+ "metadata": {},
+ "source": [
+ "You can also check which extensions are loaded in your cluster. Some
notebooks require specific extensions to be available."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "1001f412",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "'[\"druid-hdfs-storage\", \"druid-kafka-indexing-service\",
\"druid-datasketches\", \"druid-multi-stage-query\",
\"druid-lookups-cached-global\", \"druid-catalog\"]'"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "status_client.properties()['druid.extensions.loadList']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8825ca39",
+ "metadata": {},
+ "source": [
+ "## SQL Client\n",
+ "\n",
+ "Running SQL queries in a notebook is easy. Here is an example of how to
run a query and display results. The
[pydruid](https://pythonhosted.org/pydruid/) library provides a robust way to
run native queries, to run SQL queries, and to convert the results to various
formats. Here the goal is just to interact with Druid."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "6be0c745",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client = druid.sql()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d051bc5e",
+ "metadata": {},
+ "source": [
+ "Start by getting a list of schemas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "dd8387e0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+ "<tr><th>SchemaName</th></tr>\n",
+ "<tr><td>INFORMATION_SCHEMA</td></tr>\n",
+ "<tr><td>druid</td></tr>\n",
+ "<tr><td>ext</td></tr>\n",
+ "<tr><td>lookup</td></tr>\n",
+ "<tr><td>sys</td></tr>\n",
+ "<tr><td>view</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql_client.show_schemas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8261ab0",
+ "metadata": {},
+ "source": [
+ "Then, retreive the tables (or datasources) within any schema."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "64dcb46a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+ "<tr><th>TableName</th></tr>\n",
+ "<tr><td>COLUMNS</td></tr>\n",
+ "<tr><td>PARAMETERS</td></tr>\n",
+ "<tr><td>SCHEMATA</td></tr>\n",
+ "<tr><td>TABLES</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql_client.show_tables('INFORMATION_SCHEMA')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ff311595",
+ "metadata": {},
+ "source": [
+ "The above shows the list of datasources by default. You'll get an empty
result if you have no datasources yet."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "616770ce",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+ "<tr><th>TableName</th></tr>\n",
+ "<tr><td>myWiki</td></tr>\n",
+ "<tr><td>myWiki3</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql_client.show_tables()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7392e484",
+ "metadata": {},
+ "source": [
+ "You can easily run a query and show the results:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "2c649eef",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+ "<tr><th>TABLE_NAME</th></tr>\n",
+ "<tr><td>COLUMNS</td></tr>\n",
+ "<tr><td>PARAMETERS</td></tr>\n",
+ "<tr><td>SCHEMATA</td></tr>\n",
+ "<tr><td>TABLES</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+ "'''\n",
+ "sql_client.show(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c6c4e1d4",
+ "metadata": {},
+ "source": [
+ "The query above showed the same results as `show_tables()`. That is not
surprising: `show_tables()` just runs this query for you."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b944084",
+ "metadata": {},
+ "source": [
+ "The API also allows passing context parameters and query parameters using
a request object. Druid will work out the query parameter type based on the
Python type. Pass context values as a Python `dict`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "dd559827",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+ "<tr><th>TABLE_NAME</th></tr>\n",
+ "<tr><td>COLUMNS</td></tr>\n",
+ "<tr><td>PARAMETERS</td></tr>\n",
+ "<tr><td>SCHEMATA</td></tr>\n",
+ "<tr><td>TABLES</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = ?\n",
+ "'''\n",
+ "req = sql_client.sql_request(sql)\n",
+ "req.add_parameter('INFORMATION_SCHEMA')\n",
+ "req.with_context({\"someParameter\": \"someValue\"})\n",
+ "sql_client.show(req)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "937dc6b1",
+ "metadata": {},
+ "source": [
+ "The request has other features for advanced use cases: see the code for
details. The query API actually returns a sql response object. Use this if you
want to get the values directly, work with the schema, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "fd7a1827",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME\n",
+ "FROM INFORMATION_SCHEMA.TABLES\n",
+ "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "2fe6a749",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "TABLE_NAME VARCHAR string\n"
+ ]
+ }
+ ],
+ "source": [
+ "col1 = resp.schema()[0]\n",
+ "print(col1.name, col1.sql_type, col1.druid_type)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "41d27bb1",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[{'TABLE_NAME': 'COLUMNS'},\n",
+ " {'TABLE_NAME': 'PARAMETERS'},\n",
+ " {'TABLE_NAME': 'SCHEMATA'},\n",
+ " {'TABLE_NAME': 'TABLES'}]"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "resp.rows()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "481af1f2",
+ "metadata": {},
+ "source": [
+ "The `show()` method uses this information for format an HTML table to
present the results."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9e3be017",
+ "metadata": {},
+ "source": [
+ "## MSQ Ingestion\n",
+ "\n",
+ "The SQL client also performs MSQ-based ingestion using `INSERT` or
`REPLACE` statements. Use the extension check above to ensure that
`druid-multi-stage-query` is loaded in Druid 26. (Later versions may have MSQ
built in.)\n",
+ "\n",
+ "An MSQ query is run using a different API: `task()`. This API returns a
response object that describes the Overlord task which runs the MSQ query. For
tutorials, data is usually small enough you can wait for the ingestion to
complete. Do that with the `run_task()` call which handles the waiting. To
illustrate, here is a query that ingests a subset of columns, and includes a
few data clean-up steps:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "10f1e451",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "REPLACE INTO \"myWiki1\" OVERWRITE ALL\n",
+ "SELECT\n",
+ " TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " namespace,\n",
+ " page,\n",
+ " channel,\n",
+ " \"user\",\n",
+ " countryName,\n",
+ " CASE WHEN isRobot = 'true' THEN 1 ELSE 0 END AS isRobot,\n",
+ " \"added\",\n",
+ " \"delta\",\n",
+ " CASE WHEN isNew = 'true' THEN 1 ELSE 0 END AS isNew,\n",
+ " CAST(\"deltaBucket\" AS DOUBLE) AS deltaBucket,\n",
+ " \"deleted\"\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}',\n",
+ "
'[{\"name\":\"isRobot\",\"type\":\"string\"},{\"name\":\"channel\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"flags\",\"type\":\"string\"},{\"name\":\"isUnpatrolled\",\"type\":\"string\"},{\"name\":\"page\",\"type\":\"string\"},{\"name\":\"diffUrl\",\"type\":\"string\"},{\"name\":\"added\",\"type\":\"long\"},{\"name\":\"comment\",\"type\":\"string\"},{\"name\":\"commentLength\",\"type\":\"long\"},{\"name\":\"isNew\",\"type\":\"string\"},{\"name\":\"isMinor\",\"type\":\"string\"},{\"name\":\"delta\",\"type\":\"long\"},{\"name\":\"isAnonymous\",\"type\":\"string\"},{\"name\":\"user\",\"type\":\"string\"},{\"name\":\"deltaBucket\",\"type\":\"long\"},{\"name\":\"deleted\",\"type\":\"long\"},{\"name\":\"namespace\",\"type\":\"string\"},{\"name\":\"cityName\",\"type\":\"string\"},{\"name\":\"countryName\",\"type\":\"string\"},{\"name\":\"regionIsoCode\",\"type\":\"string\"},{\"name\":\"metroCode\",\"type\":\"long\"},{\"name\":\"countryIsoCode\",
\"type\":\"string\"},{\"name\":\"regionName\",\"type\":\"string\"}]'\n",
+ " )\n",
+ ")\n",
+ "PARTITIONED BY DAY\n",
+ "CLUSTERED BY namespace, page\n",
+ "'''"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "d752b1d4",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client.run_task(sql)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ef4512f8",
+ "metadata": {},
+ "source": [
+ "MSQ reports task completion as soon as ingestion is done. However, it
takes a while for Druid to load the resulting segments. Wait for the table to
become ready."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "id": "37fcedf2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_client.wait_until_ready('myWiki1')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "11d9c95a",
+ "metadata": {},
+ "source": [
+ "`describe_table()` lists the columns in a table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "id": "b662697b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+ "<tr><th>Position</th><th>Name</th><th>Type</th></tr>\n",
+ "<tr><td>1</td><td>__time</td><td>TIMESTAMP</td></tr>\n",
+ "<tr><td>2</td><td>namespace</td><td>VARCHAR</td></tr>\n",
+ "<tr><td>3</td><td>page</td><td>VARCHAR</td></tr>\n",
+ "<tr><td>4</td><td>channel</td><td>VARCHAR</td></tr>\n",
+ "<tr><td>5</td><td>user</td><td>VARCHAR</td></tr>\n",
+ "<tr><td>6</td><td>countryName</td><td>VARCHAR</td></tr>\n",
+ "<tr><td>7</td><td>isRobot</td><td>BIGINT</td></tr>\n",
+ "<tr><td>8</td><td>added</td><td>BIGINT</td></tr>\n",
+ "<tr><td>9</td><td>delta</td><td>BIGINT</td></tr>\n",
+ "<tr><td>10</td><td>isNew</td><td>BIGINT</td></tr>\n",
+ "<tr><td>11</td><td>deltaBucket</td><td>DOUBLE</td></tr>\n",
+ "<tr><td>12</td><td>deleted</td><td>BIGINT</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql_client.describe_table('myWiki1')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "936f57fb",
+ "metadata": {},
+ "source": [
+ "You can sample a few rows of data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "id": "c4cfa5dc",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div class=\"druid\"><table>\n",
+
"<tr><th>__time</th><th>namespace</th><th>page</th><th>channel</th><th>user</th><th>countryName</th><th>isRobot</th><th>added</th><th>delta</th><th>isNew</th><th>deltaBucket</th><th>deleted</th></tr>\n",
+ "<tr><td>2016-06-27T00:00:11.080Z</td><td>Main</td><td>Salo
Toraut</td><td>#sv.wikipedia</td><td>Lsjbot</td><td></td><td>1</td><td>31</td><td>31</td><td>1</td><td>0.0</td><td>0</td></tr>\n",
+
"<tr><td>2016-06-27T00:00:17.457Z</td><td>利用者</td><td>利用者:ワーナー成増/放送ウーマン賞</td><td>#ja.wikipedia</td><td>ワーナー成増</td><td></td><td>0</td><td>125</td><td>125</td><td>0</td><td>100.0</td><td>0</td></tr>\n",
+ "<tr><td>2016-06-27T00:00:34.959Z</td><td>Main</td><td>Bailando
2015</td><td>#en.wikipedia</td><td>181.230.118.178</td><td>Argentina</td><td>0</td><td>2</td><td>2</td><td>0</td><td>0.0</td><td>0</td></tr>\n",
+ "<tr><td>2016-06-27T00:00:36.027Z</td><td>Main</td><td>Richie
Rich's Christmas
Wish</td><td>#en.wikipedia</td><td>JasonAQuest</td><td></td><td>0</td><td>0</td><td>-2</td><td>0</td><td>-100.0</td><td>2</td></tr>\n",
+ "<tr><td>2016-06-27T00:00:46.874Z</td><td>Main</td><td>El Olivo,
Ascensión</td><td>#sh.wikipedia</td><td>Kolega2357</td><td></td><td>1</td><td>0</td><td>-1</td><td>0</td><td>-100.0</td><td>1</td></tr>\n",
+ "<tr><td>2016-06-27T00:00:56.913Z</td><td>Main</td><td>Blowback
(intelligence)</td><td>#en.wikipedia</td><td>Brokenshardz</td><td></td><td>0</td><td>76</td><td>76</td><td>0</td><td>0.0</td><td>0</td></tr>\n",
+
"<tr><td>2016-06-27T00:00:58.599Z</td><td>Kategoria</td><td>Kategoria:Dyskusje
nad usunięciem artykułu zakończone bez konsensusu − lipiec
2016</td><td>#pl.wikipedia</td><td>Beau.bot</td><td></td><td>1</td><td>270</td><td>270</td><td>1</td><td>200.0</td><td>0</td></tr>\n",
+ "<tr><td>2016-06-27T00:01:01.364Z</td><td>Main</td><td>El Paraíso,
Bachíniva</td><td>#sh.wikipedia</td><td>Kolega2357</td><td></td><td>1</td><td>0</td><td>-1</td><td>0</td><td>-100.0</td><td>1</td></tr>\n",
+ "<tr><td>2016-06-27T00:01:03.685Z</td><td>Main</td><td>El Terco,
Bachíniva</td><td>#sh.wikipedia</td><td>Kolega2357</td><td></td><td>1</td><td>0</td><td>-1</td><td>0</td><td>-100.0</td><td>1</td></tr>\n",
+
"<tr><td>2016-06-27T00:01:07.347Z</td><td>Main</td><td>Neqerssuaq</td><td>#ceb.wikipedia</td><td>Lsjbot</td><td></td><td>1</td><td>4150</td><td>4150</td><td>1</td><td>4100.0</td><td>0</td></tr>\n",
+ "</table></div>"
+ ],
+ "text/plain": [
+ "<IPython.core.display.HTML object>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "sql_client.show('SELECT * FROM myWiki1 LIMIT 10')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1152f41",
+ "metadata": {},
+ "source": [
+ "## Datasource Client\n",
+ "\n",
+ "The Datasource client lets you perform operations on datasource objects.
The SQL layer allows you to get metadata and do queries. The datasource client
works with the underlying segments. Explaining the full functionality is the
topic of another notebook. For now, you can use the datasource client to clean
up the datasource created above. The `True` argument asks for \"if exists\"
semantics so you don't get an error if the datasource was alredy deleted."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "id": "fba659ce",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ds_client = druid.datasources()\n",
+ "ds_client.drop('myWiki', True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c96fdcc6",
+ "metadata": {},
+ "source": [
+ "## Tasks Client\n",
+ "\n",
+ "Use the tasks client to work with Overlord tasks. The `run_task()` call
above actually uses the task client internally to poll Overlord."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "id": "b4f5ea17",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[{'id': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8-worker0_0',\n",
+ " 'groupId': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8',\n",
+ " 'type': 'query_worker',\n",
+ " 'createdTime': '2023-02-09T22:49:01.761Z',\n",
+ " 'queueInsertionTime': '1970-01-01T00:00:00.000Z',\n",
+ " 'statusCode': 'SUCCESS',\n",
+ " 'status': 'SUCCESS',\n",
+ " 'runnerStatusCode': 'NONE',\n",
+ " 'duration': 57895,\n",
+ " 'location': {'host': 'localhost', 'port': 8101, 'tlsPort': -1},\n",
+ " 'dataSource': 'myWiki1',\n",
+ " 'errorMsg': None},\n",
+ " {'id': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8',\n",
+ " 'groupId': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8',\n",
+ " 'type': 'query_controller',\n",
+ " 'createdTime': '2023-02-09T22:48:30.512Z',\n",
+ " 'queueInsertionTime': '1970-01-01T00:00:00.000Z',\n",
+ " 'statusCode': 'SUCCESS',\n",
+ " 'status': 'SUCCESS',\n",
+ " 'runnerStatusCode': 'NONE',\n",
+ " 'duration': 92476,\n",
+ " 'location': {'host': 'localhost', 'port': 8100, 'tlsPort': -1},\n",
+ " 'dataSource': 'myWiki1',\n",
+ " 'errorMsg': None}]"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "task_client = druid.tasks()\n",
+ "task_client.tasks()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1deaf95f",
+ "metadata": {},
+ "source": [
+ "## REST Client\n",
+ "\n",
+ "The Druid Python API starts with a REST client that itself is built on
the `requests` package. The REST client implements the common patterns seen in
the Druid REST API. You can create a client directly:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "id": "b1e55635",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from druidapi.rest import DruidRestClient\n",
+ "rest_client = DruidRestClient(\"http://localhost:8888\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dcb8055f",
+ "metadata": {},
+ "source": [
+ "Or, if you have already created the Druid client, you can reuse the
existing REST client. This is how the various other clients work internally."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "370ba76a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rest_client = druid.rest()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2654e72c",
+ "metadata": {},
+ "source": [
+ "Use the REST client if you need to make calls that are not yet wrapped by
the Python API, or if you want to do something special. To illustrate the
client, you can make some of the same calls as in the [Druid REST API
notebook](api_tutorial.ipynb).\n",
+ "\n",
+ "The REST API maintains the Druid host: you just provide the specifc URL
tail. There are methods to get or post JSON results. For example, to get status
information:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "id": "9e42dfbc",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "{'version': '26.0.0-SNAPSHOT',\n",
+ " 'modules': [{'name': 'org.apache.druid.common.aws.AWSModule',\n",
+ " 'artifact': 'druid-aws-common',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.common.gcp.GcpModule',\n",
+ " 'artifact': 'druid-gcp-common',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.storage.hdfs.HdfsStorageDruidModule',\n",
+ " 'artifact': 'druid-hdfs-storage',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.indexing.kafka.KafkaIndexTaskModule',\n",
+ " 'artifact': 'druid-kafka-indexing-service',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.query.aggregation.datasketches.theta.SketchModule',\n",
+ " 'artifact': 'druid-datasketches',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.query.aggregation.datasketches.theta.oldapi.OldApiSketchModule',\n",
+ " 'artifact': 'druid-datasketches',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.query.aggregation.datasketches.quantiles.DoublesSketchModule',\n",
+ " 'artifact': 'druid-datasketches',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule',\n",
+ " 'artifact': 'druid-datasketches',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.query.aggregation.datasketches.hll.HllSketchModule',\n",
+ " 'artifact': 'druid-datasketches',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.query.aggregation.datasketches.kll.KllSketchModule',\n",
+ " 'artifact': 'druid-datasketches',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.msq.guice.MSQExternalDataSourceModule',\n",
+ " 'artifact': 'druid-multi-stage-query',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.msq.guice.MSQIndexingModule',\n",
+ " 'artifact': 'druid-multi-stage-query',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.msq.guice.MSQDurableStorageModule',\n",
+ " 'artifact': 'druid-multi-stage-query',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.msq.guice.MSQServiceClientModule',\n",
+ " 'artifact': 'druid-multi-stage-query',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.msq.guice.MSQSqlModule',\n",
+ " 'artifact': 'druid-multi-stage-query',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.msq.guice.SqlTaskModule',\n",
+ " 'artifact': 'druid-multi-stage-query',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.server.lookup.namespace.NamespaceExtractionModule',\n",
+ " 'artifact': 'druid-lookups-cached-global',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name':
'org.apache.druid.catalog.guice.CatalogCoordinatorModule',\n",
+ " 'artifact': 'druid-catalog',\n",
+ " 'version': '26.0.0-SNAPSHOT'},\n",
+ " {'name': 'org.apache.druid.catalog.guice.CatalogBrokerModule',\n",
+ " 'artifact': 'druid-catalog',\n",
+ " 'version': '26.0.0-SNAPSHOT'}],\n",
+ " 'memory': {'maxMemory': 134217728,\n",
+ " 'totalMemory': 134217728,\n",
+ " 'freeMemory': 80642696,\n",
+ " 'usedMemory': 53575032,\n",
+ " 'directMemory': 134217728}}"
+ ]
+ },
+ "execution_count": 53,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "rest_client.get_json('/status')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "837e08b0",
+ "metadata": {},
+ "source": [
+ "A quick comparison of the three approaches (Requests, REST client, Python
client):\n",
+ "\n",
+ "Status:\n",
+ "* Requests: `session.get(druid_host + '/status').json()`\n",
+ "* REST client: `rest_client.get_json('/status')`\n",
+ "* Status client: `status_client.status()`\n",
+ "\n",
+ "Health:\n",
+ "* Requests: `session.get(druid_host + '/status/health').json()`\n",
+ "* REST client: `rest_client.get_json('/status/health')`\n",
+ "* Status client: `status_client.is_healthy()`\n",
+ "\n",
+ "Ingest data:\n",
+ "* Requests: See the [REST tutorial](api_tutorial.ipynb)\n",
+ "* REST client: as the REST tutorial, but use
`rest_client.post_json('/druid/v2/sql/task', sql_request)` and\n",
+ "
`rest_client.get_json(f\"/druid/indexer/v1/task/{ingestion_taskId}/status\")`\n",
+ "* SQL client: `sql_client.run_task(sql)`, also a form for a full SQL
request.\n",
+ "\n",
+ "List datasources:\n",
+ "* Requests: `session.get(druid_host +
'/druid/coordinator/v1/datasources').json()`\n",
+ "* REST client:
`rest_client.get_json('/druid/coordinator/v1/datasources')`\n",
+ "* Datasources client: `ds_client.names()`\n",
+ "\n",
+ "Query data:\n",
+ "* Requests: `session.get(druid_host + '/druid/v2/sql',
json=sql_request).json()`\n",
+ "* REST client: `rest_client.get_json('/druid/v2/sql', sql_request)`\n",
Review Comment:
and if there's a query with no headers etc, I had to use `post_only_json` to
get it to work
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]