[GitHub] [druid] techdocsmith commented on a diff in pull request #13787: Python Druid API for use in notebooks

via GitHub Thu, 23 Feb 2023 15:58:44 -0800


techdocsmith commented on code in PR #13787:
URL: https://github.com/apache/druid/pull/13787#discussion_r1116368644



##########
examples/quickstart/jupyter-notebooks/Python_API_Tutorial.ipynb:
##########
@@ -0,0 +1,1281 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ce2efaaa",
+   "metadata": {},
+   "source": [
+    "# Learn the Druid Python API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "This notebook provides a quick introduction to the Python wrapper around 
the [Druid REST API](api-tutorial.ipynb). This notebook assumes you are 
familiar with the basics of the REST API, and the [set of operations which 
Druid 
provides](https://druid.apache.org/docs/latest/operations/api-reference.html). 
This tutorial focuses on using Python to access those APIs rather than 
explaining the APIs themselves. The APIs themselves are covered in other 
notebooks that use the Python API.\n",
+    "\n",
+    "The Druid Python API is primarily intended to help with these notebook 
tutorials. It can also be used in your own ad-hoc notebooks, or in a regular 
Python program.\n",
+    "\n",
+    "The Druid Python API is a work in progress. The Druid team adds API 
wrappers as needed for the notebook tutorials. If you find you need additional 
wrappers, please feel free to add them, and post a PR to Apache Druid with your 
additions.\n",
+    "\n",
+    "The API provides two levels of functions. Most are simple wrappers around 
Druid's REST APIs. Others add additional code to make the API easier to use. 
The SQL query interface is a prime example: extra code translates a simple SQL 
query into Druid's `SQLQuery` object and interprets the results into a form 
that can be displayed in a notebook.\n",
+    "\n",
+    "Start by importing the `druidapi` package from the same folder as this 
notebook. The `styles()` calls adds some CSS styles needed to display results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "6d90ca5d",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "<style>\n",
+       "  .druid table {\n",
+       "    border: 1px solid black;\n",
+       "    border-collapse: collapse;\n",
+       "  }\n",
+       "\n",
+       "  .druid th, .druid td {\n",
+       "    padding: 4px 1em ;\n",
+       "    text-align: left;\n",
+       "  }\n",
+       "\n",
+       "  td.druid-right, th.druid-right {\n",
+       "    text-align: right;\n",
+       "  }\n",
+       "\n",
+       "  td.druid-center, th.druid-center {\n",
+       "    text-align: center;\n",
+       "  }\n",
+       "\n",
+       "  .druid .druid-left {\n",
+       "    text-align: left;\n",
+       "  }\n",
+       "\n",
+       "  .druid-alert {\n",
+       "    color: red;\n",
+       "  }\n",
+       "</style>\n"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "\n",
+       "<style>\n",
+       "  .druid table {\n",
+       "    border: 1px solid black;\n",
+       "    border-collapse: collapse;\n",
+       "  }\n",
+       "\n",
+       "  .druid th, .druid td {\n",
+       "    padding: 4px 1em ;\n",
+       "    text-align: left;\n",
+       "  }\n",
+       "\n",
+       "  td.druid-right, th.druid-right {\n",
+       "    text-align: right;\n",
+       "  }\n",
+       "\n",
+       "  td.druid-center, th.druid-center {\n",
+       "    text-align: center;\n",
+       "  }\n",
+       "\n",
+       "  .druid .druid-left {\n",
+       "    text-align: left;\n",
+       "  }\n",
+       "\n",
+       "  .druid-alert {\n",
+       "    color: red;\n",
+       "  }\n",
+       "</style>\n"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import druidapi\n",
+    "druidapi.styles()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb68a838",
+   "metadata": {},
+   "source": [
+    "Next, connect to your cluster by providing the router endpoint. The code 
assumes the cluster is on your local machine, using the default port. Go ahead 
and change this if your setup is different.\n",
+    "\n",
+    "The API uses the router to forward messages to each of Druid's services 
so that you don't have to keep track of the host and port for each service."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "ae601081",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "druid = druidapi.client('http://localhost:8888')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b4e774b",
+   "metadata": {},
+   "source": [
+    "## Status Client\n",
+    "\n",
+    "The SDK groups Druid REST API calls into categories, with a client for 
each. Start with the status client."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "ff16fc3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "status_client = druid.status()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be992774",
+   "metadata": {},
+   "source": [
+    "Use the Python `help()` function to learn what methods are avaialble."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "03f26417",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Help on StatusClient in module druidapi.status object:\n",
+      "\n",
+      "class StatusClient(builtins.object)\n",
+      " |  StatusClient(rest_client)\n",
+      " |  \n",
+      " |  Client for status APIs. These APIs are available on all nodes.\n",
+      " |  If used with the router, they report the status of just the 
router.\n",
+      " |  \n",
+      " |  Methods defined here:\n",
+      " |  \n",
+      " |  __init__(self, rest_client)\n",
+      " |      Initialize self.  See help(type(self)) for accurate 
signature.\n",
+      " |  \n",
+      " |  brokers(self)\n",
+      " |  \n",
+      " |  in_cluster(self)\n",
+      " |      Returns `True` if the node is visible wihtin the cluster, 
`False` if not.\n",
+      " |      (That is, returns the value of the `{\"selfDiscovered\": 
true/false}`\n",
+      " |      field in the response.\n",
+      " |      \n",
+      " |      GET `/status/selfDiscovered/status`\n",
+      " |      \n",
+      " |      See 
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n";,
+      " |  \n",
+      " |  is_healthy(self) -> bool\n",
+      " |      Returns `True` if the node is healthy, an exception 
otherwise.\n",
+      " |      Useful for automated health checks.\n",
+      " |      \n",
+      " |      GET `/status/health`\n",
+      " |      \n",
+      " |      See 
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n";,
+      " |  \n",
+      " |  properties(self) -> map\n",
+      " |      Returns the effective set of Java properties used by the 
service, including\n",
+      " |      system properties and properties from the 
`common_runtime.propeties` and\n",
+      " |      `runtime.properties` files.\n",
+      " |      \n",
+      " |      GET `/status/properties`\n",
+      " |      \n",
+      " |      See 
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n";,
+      " |  \n",
+      " |  status(self)\n",
+      " |      Returns the Druid version, loaded extensions, memory used, 
total memory \n",
+      " |      and other useful information about the process.\n",
+      " |      \n",
+      " |      GET `/status`\n",
+      " |      \n",
+      " |      See 
https://druid.apache.org/docs/latest/operations/api-reference.html#process-information\n";,
+      " |  \n",
+      " |  version(self)\n",
+      " |  \n",
+      " |  wait_until_ready(self)\n",
+      " |  \n",
+      " |  
----------------------------------------------------------------------\n",
+      " |  Data descriptors defined here:\n",
+      " |  \n",
+      " |  __dict__\n",
+      " |      dictionary for instance variables (if defined)\n",
+      " |  \n",
+      " |  __weakref__\n",
+      " |      list of weak references to the object (if defined)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "help(status_client)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70f3d578",
+   "metadata": {},
+   "source": [
+    "Druid servers return unexpected results if you make REST calls while 
Druid starts up. The following will run until the server is ready. If you 
forgot to start your server, or the URL above is wrong, this will hang forever. 
Use the Kernel &rarr; Interrupt command to break out of the function. (Or, 
start your server. If your server refuses to start, then this Jupyter Notebook 
may be running on port 8888. See the [README](README.md) for how to start on a 
different port.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "114ed0d1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "status_client.wait_until_ready()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e803c9fe",
+   "metadata": {},
+   "source": [
+    "Check the version of your cluster. Some of these notebooks illustrate 
newer features available only on specific versions of Druid."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "2faa0d81",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'26.0.0-SNAPSHOT'"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "status_client.version()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d78a6c35",
+   "metadata": {},
+   "source": [
+    "You can also check which extensions are loaded in your cluster. Some 
notebooks require specific extensions to be available."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "1001f412",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'[\"druid-hdfs-storage\", \"druid-kafka-indexing-service\", 
\"druid-datasketches\", \"druid-multi-stage-query\", 
\"druid-lookups-cached-global\", \"druid-catalog\"]'"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "status_client.properties()['druid.extensions.loadList']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8825ca39",
+   "metadata": {},
+   "source": [
+    "## SQL Client\n",
+    "\n",
+    "Running SQL queries in a notebook is easy. Here is an example of how to 
run a query and display results. The 
[pydruid](https://pythonhosted.org/pydruid/) library provides a robust way to 
run native queries, to run SQL queries, and to convert the results to various 
formats. Here the goal is just to interact with Druid."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "6be0c745",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_client = druid.sql()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d051bc5e",
+   "metadata": {},
+   "source": [
+    "Start by getting a list of schemas."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "dd8387e0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       "<tr><th>SchemaName</th></tr>\n",
+       "<tr><td>INFORMATION_SCHEMA</td></tr>\n",
+       "<tr><td>druid</td></tr>\n",
+       "<tr><td>ext</td></tr>\n",
+       "<tr><td>lookup</td></tr>\n",
+       "<tr><td>sys</td></tr>\n",
+       "<tr><td>view</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql_client.show_schemas()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8261ab0",
+   "metadata": {},
+   "source": [
+    "Then, retreive the tables (or datasources) within any schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "64dcb46a",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       "<tr><th>TableName</th></tr>\n",
+       "<tr><td>COLUMNS</td></tr>\n",
+       "<tr><td>PARAMETERS</td></tr>\n",
+       "<tr><td>SCHEMATA</td></tr>\n",
+       "<tr><td>TABLES</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql_client.show_tables('INFORMATION_SCHEMA')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff311595",
+   "metadata": {},
+   "source": [
+    "The above shows the list of datasources by default. You'll get an empty 
result if you have no datasources yet."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "616770ce",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       "<tr><th>TableName</th></tr>\n",
+       "<tr><td>myWiki</td></tr>\n",
+       "<tr><td>myWiki3</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql_client.show_tables()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7392e484",
+   "metadata": {},
+   "source": [
+    "You can easily run a query and show the results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "2c649eef",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       "<tr><th>TABLE_NAME</th></tr>\n",
+       "<tr><td>COLUMNS</td></tr>\n",
+       "<tr><td>PARAMETERS</td></tr>\n",
+       "<tr><td>SCHEMATA</td></tr>\n",
+       "<tr><td>TABLES</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql = '''\n",
+    "SELECT TABLE_NAME\n",
+    "FROM INFORMATION_SCHEMA.TABLES\n",
+    "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+    "'''\n",
+    "sql_client.show(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c6c4e1d4",
+   "metadata": {},
+   "source": [
+    "The query above showed the same results as `show_tables()`. That is not 
surprising: `show_tables()` just runs this query for you."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b944084",
+   "metadata": {},
+   "source": [
+    "The API also allows passing context parameters and query parameters using 
a request object. Druid will work out the query parameter type based on the 
Python type. Pass context values as a Python `dict`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "dd559827",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       "<tr><th>TABLE_NAME</th></tr>\n",
+       "<tr><td>COLUMNS</td></tr>\n",
+       "<tr><td>PARAMETERS</td></tr>\n",
+       "<tr><td>SCHEMATA</td></tr>\n",
+       "<tr><td>TABLES</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql = '''\n",
+    "SELECT TABLE_NAME\n",
+    "FROM INFORMATION_SCHEMA.TABLES\n",
+    "WHERE TABLE_SCHEMA = ?\n",
+    "'''\n",
+    "req = sql_client.sql_request(sql)\n",
+    "req.add_parameter('INFORMATION_SCHEMA')\n",
+    "req.with_context({\"someParameter\": \"someValue\"})\n",
+    "sql_client.show(req)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "937dc6b1",
+   "metadata": {},
+   "source": [
+    "The request has other features for advanced use cases: see the code for 
details. The query API actually returns a sql response object. Use this if you 
want to get the values directly, work with the schema, etc."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "fd7a1827",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT TABLE_NAME\n",
+    "FROM INFORMATION_SCHEMA.TABLES\n",
+    "WHERE TABLE_SCHEMA = 'INFORMATION_SCHEMA'\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "2fe6a749",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "TABLE_NAME VARCHAR string\n"
+     ]
+    }
+   ],
+   "source": [
+    "col1 = resp.schema()[0]\n",
+    "print(col1.name, col1.sql_type, col1.druid_type)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "id": "41d27bb1",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'TABLE_NAME': 'COLUMNS'},\n",
+       " {'TABLE_NAME': 'PARAMETERS'},\n",
+       " {'TABLE_NAME': 'SCHEMATA'},\n",
+       " {'TABLE_NAME': 'TABLES'}]"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "resp.rows()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "481af1f2",
+   "metadata": {},
+   "source": [
+    "The `show()` method uses this information for format an HTML table to 
present the results."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9e3be017",
+   "metadata": {},
+   "source": [
+    "## MSQ Ingestion\n",
+    "\n",
+    "The SQL client also performs MSQ-based ingestion using `INSERT` or 
`REPLACE` statements. Use the extension check above to ensure that 
`druid-multi-stage-query` is loaded in Druid 26. (Later versions may have MSQ 
built in.)\n",
+    "\n",
+    "An MSQ query is run using a different API: `task()`. This API returns a 
response object that describes the Overlord task which runs the MSQ query. For 
tutorials, data is usually small enough you can wait for the ingestion to 
complete. Do that with the `run_task()` call which handles the waiting. To 
illustrate, here is a query that ingests a subset of columns, and includes a 
few data clean-up steps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "10f1e451",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "REPLACE INTO \"myWiki1\" OVERWRITE ALL\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+    "  namespace,\n",
+    "  page,\n",
+    "  channel,\n",
+    "  \"user\",\n",
+    "  countryName,\n",
+    "  CASE WHEN isRobot = 'true' THEN 1 ELSE 0 END AS isRobot,\n",
+    "  \"added\",\n",
+    "  \"delta\",\n",
+    "  CASE WHEN isNew = 'true' THEN 1 ELSE 0 END AS isNew,\n",
+    "  CAST(\"deltaBucket\" AS DOUBLE) AS deltaBucket,\n",
+    "  \"deleted\"\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+    "    '{\"type\":\"json\"}',\n",
+    "    
'[{\"name\":\"isRobot\",\"type\":\"string\"},{\"name\":\"channel\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"flags\",\"type\":\"string\"},{\"name\":\"isUnpatrolled\",\"type\":\"string\"},{\"name\":\"page\",\"type\":\"string\"},{\"name\":\"diffUrl\",\"type\":\"string\"},{\"name\":\"added\",\"type\":\"long\"},{\"name\":\"comment\",\"type\":\"string\"},{\"name\":\"commentLength\",\"type\":\"long\"},{\"name\":\"isNew\",\"type\":\"string\"},{\"name\":\"isMinor\",\"type\":\"string\"},{\"name\":\"delta\",\"type\":\"long\"},{\"name\":\"isAnonymous\",\"type\":\"string\"},{\"name\":\"user\",\"type\":\"string\"},{\"name\":\"deltaBucket\",\"type\":\"long\"},{\"name\":\"deleted\",\"type\":\"long\"},{\"name\":\"namespace\",\"type\":\"string\"},{\"name\":\"cityName\",\"type\":\"string\"},{\"name\":\"countryName\",\"type\":\"string\"},{\"name\":\"regionIsoCode\",\"type\":\"string\"},{\"name\":\"metroCode\",\"type\":\"long\"},{\"name\":\"countryIsoCode\",
 \"type\":\"string\"},{\"name\":\"regionName\",\"type\":\"string\"}]'\n",
+    "  )\n",
+    ")\n",
+    "PARTITIONED BY DAY\n",
+    "CLUSTERED BY namespace, page\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "d752b1d4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_client.run_task(sql)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ef4512f8",
+   "metadata": {},
+   "source": [
+    "MSQ reports task completion as soon as ingestion is done. However, it 
takes a while for Druid to load the resulting segments. Wait for the table to 
become ready."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "37fcedf2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_client.wait_until_ready('myWiki1')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11d9c95a",
+   "metadata": {},
+   "source": [
+    "`describe_table()` lists the columns in a table."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "b662697b",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       "<tr><th>Position</th><th>Name</th><th>Type</th></tr>\n",
+       "<tr><td>1</td><td>__time</td><td>TIMESTAMP</td></tr>\n",
+       "<tr><td>2</td><td>namespace</td><td>VARCHAR</td></tr>\n",
+       "<tr><td>3</td><td>page</td><td>VARCHAR</td></tr>\n",
+       "<tr><td>4</td><td>channel</td><td>VARCHAR</td></tr>\n",
+       "<tr><td>5</td><td>user</td><td>VARCHAR</td></tr>\n",
+       "<tr><td>6</td><td>countryName</td><td>VARCHAR</td></tr>\n",
+       "<tr><td>7</td><td>isRobot</td><td>BIGINT</td></tr>\n",
+       "<tr><td>8</td><td>added</td><td>BIGINT</td></tr>\n",
+       "<tr><td>9</td><td>delta</td><td>BIGINT</td></tr>\n",
+       "<tr><td>10</td><td>isNew</td><td>BIGINT</td></tr>\n",
+       "<tr><td>11</td><td>deltaBucket</td><td>DOUBLE</td></tr>\n",
+       "<tr><td>12</td><td>deleted</td><td>BIGINT</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql_client.describe_table('myWiki1')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "936f57fb",
+   "metadata": {},
+   "source": [
+    "You can sample a few rows of data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "c4cfa5dc",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div class=\"druid\"><table>\n",
+       
"<tr><th>__time</th><th>namespace</th><th>page</th><th>channel</th><th>user</th><th>countryName</th><th>isRobot</th><th>added</th><th>delta</th><th>isNew</th><th>deltaBucket</th><th>deleted</th></tr>\n",
+       "<tr><td>2016-06-27T00:00:11.080Z</td><td>Main</td><td>Salo 
Toraut</td><td>#sv.wikipedia</td><td>Lsjbot</td><td></td><td>1</td><td>31</td><td>31</td><td>1</td><td>0.0</td><td>0</td></tr>\n",
+       
"<tr><td>2016-06-27T00:00:17.457Z</td><td>利用者</td><td>利用者:ワーナー成増/放送ウーマン賞</td><td>#ja.wikipedia</td><td>ワーナー成増</td><td></td><td>0</td><td>125</td><td>125</td><td>0</td><td>100.0</td><td>0</td></tr>\n",
+       "<tr><td>2016-06-27T00:00:34.959Z</td><td>Main</td><td>Bailando 
2015</td><td>#en.wikipedia</td><td>181.230.118.178</td><td>Argentina</td><td>0</td><td>2</td><td>2</td><td>0</td><td>0.0</td><td>0</td></tr>\n",
+       "<tr><td>2016-06-27T00:00:36.027Z</td><td>Main</td><td>Richie 
Rich&#x27;s Christmas 
Wish</td><td>#en.wikipedia</td><td>JasonAQuest</td><td></td><td>0</td><td>0</td><td>-2</td><td>0</td><td>-100.0</td><td>2</td></tr>\n",
+       "<tr><td>2016-06-27T00:00:46.874Z</td><td>Main</td><td>El Olivo, 
Ascensión</td><td>#sh.wikipedia</td><td>Kolega2357</td><td></td><td>1</td><td>0</td><td>-1</td><td>0</td><td>-100.0</td><td>1</td></tr>\n",
+       "<tr><td>2016-06-27T00:00:56.913Z</td><td>Main</td><td>Blowback 
(intelligence)</td><td>#en.wikipedia</td><td>Brokenshardz</td><td></td><td>0</td><td>76</td><td>76</td><td>0</td><td>0.0</td><td>0</td></tr>\n",
+       
"<tr><td>2016-06-27T00:00:58.599Z</td><td>Kategoria</td><td>Kategoria:Dyskusje 
nad usunięciem artykułu zakończone bez konsensusu − lipiec 
2016</td><td>#pl.wikipedia</td><td>Beau.bot</td><td></td><td>1</td><td>270</td><td>270</td><td>1</td><td>200.0</td><td>0</td></tr>\n",
+       "<tr><td>2016-06-27T00:01:01.364Z</td><td>Main</td><td>El Paraíso, 
Bachíniva</td><td>#sh.wikipedia</td><td>Kolega2357</td><td></td><td>1</td><td>0</td><td>-1</td><td>0</td><td>-100.0</td><td>1</td></tr>\n",
+       "<tr><td>2016-06-27T00:01:03.685Z</td><td>Main</td><td>El Terco, 
Bachíniva</td><td>#sh.wikipedia</td><td>Kolega2357</td><td></td><td>1</td><td>0</td><td>-1</td><td>0</td><td>-100.0</td><td>1</td></tr>\n",
+       
"<tr><td>2016-06-27T00:01:07.347Z</td><td>Main</td><td>Neqerssuaq</td><td>#ceb.wikipedia</td><td>Lsjbot</td><td></td><td>1</td><td>4150</td><td>4150</td><td>1</td><td>4100.0</td><td>0</td></tr>\n",
+       "</table></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "sql_client.show('SELECT * FROM myWiki1 LIMIT 10')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1152f41",
+   "metadata": {},
+   "source": [
+    "## Datasource Client\n",
+    "\n",
+    "The Datasource client lets you perform operations on datasource objects. 
The SQL layer allows you to get metadata and do queries. The datasource client 
works with the underlying segments. Explaining the full functionality is the 
topic of another notebook. For now, you can use the datasource client to clean 
up the datasource created above. The `True` argument asks for \"if exists\" 
semantics so you don't get an error if the datasource was alredy deleted."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 37,
+   "id": "fba659ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds_client = druid.datasources()\n",
+    "ds_client.drop('myWiki', True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c96fdcc6",
+   "metadata": {},
+   "source": [
+    "## Tasks Client\n",
+    "\n",
+    "Use the tasks client to work with Overlord tasks. The `run_task()` call 
above actually uses the task client internally to poll Overlord."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 40,
+   "id": "b4f5ea17",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'id': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8-worker0_0',\n",
+       "  'groupId': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8',\n",
+       "  'type': 'query_worker',\n",
+       "  'createdTime': '2023-02-09T22:49:01.761Z',\n",
+       "  'queueInsertionTime': '1970-01-01T00:00:00.000Z',\n",
+       "  'statusCode': 'SUCCESS',\n",
+       "  'status': 'SUCCESS',\n",
+       "  'runnerStatusCode': 'NONE',\n",
+       "  'duration': 57895,\n",
+       "  'location': {'host': 'localhost', 'port': 8101, 'tlsPort': -1},\n",
+       "  'dataSource': 'myWiki1',\n",
+       "  'errorMsg': None},\n",
+       " {'id': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8',\n",
+       "  'groupId': 'query-24066a63-7e20-41bb-b212-80f193e6f2c8',\n",
+       "  'type': 'query_controller',\n",
+       "  'createdTime': '2023-02-09T22:48:30.512Z',\n",
+       "  'queueInsertionTime': '1970-01-01T00:00:00.000Z',\n",
+       "  'statusCode': 'SUCCESS',\n",
+       "  'status': 'SUCCESS',\n",
+       "  'runnerStatusCode': 'NONE',\n",
+       "  'duration': 92476,\n",
+       "  'location': {'host': 'localhost', 'port': 8100, 'tlsPort': -1},\n",
+       "  'dataSource': 'myWiki1',\n",
+       "  'errorMsg': None}]"
+      ]
+     },
+     "execution_count": 40,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "task_client = druid.tasks()\n",
+    "task_client.tasks()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1deaf95f",
+   "metadata": {},
+   "source": [
+    "## REST Client\n",
+    "\n",
+    "The Druid Python API starts with a REST client that itself is built on 
the `requests` package. The REST client implements the common patterns seen in 
the Druid REST API. You can create a client directly:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "b1e55635",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from druidapi.rest import DruidRestClient\n",
+    "rest_client = DruidRestClient(\"http://localhost:8888\";)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcb8055f",
+   "metadata": {},
+   "source": [
+    "Or, if you have already created the Druid client, you can reuse the 
existing REST client. This is how the various other clients work internally."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "370ba76a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rest_client = druid.rest()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2654e72c",
+   "metadata": {},
+   "source": [
+    "Use the REST client if you need to make calls that are not yet wrapped by 
the Python API, or if you want to do something special. To illustrate the 
client, you can make some of the same calls as in the [Druid REST API 
notebook](api_tutorial.ipynb).\n",
+    "\n",
+    "The REST API maintains the Druid host: you just provide the specifc URL 
tail. There are methods to get or post JSON results. For example, to get status 
information:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 53,
+   "id": "9e42dfbc",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'version': '26.0.0-SNAPSHOT',\n",
+       " 'modules': [{'name': 'org.apache.druid.common.aws.AWSModule',\n",
+       "   'artifact': 'druid-aws-common',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.common.gcp.GcpModule',\n",
+       "   'artifact': 'druid-gcp-common',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.storage.hdfs.HdfsStorageDruidModule',\n",
+       "   'artifact': 'druid-hdfs-storage',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.indexing.kafka.KafkaIndexTaskModule',\n",
+       "   'artifact': 'druid-kafka-indexing-service',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.query.aggregation.datasketches.theta.SketchModule',\n",
+       "   'artifact': 'druid-datasketches',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.query.aggregation.datasketches.theta.oldapi.OldApiSketchModule',\n",
+       "   'artifact': 'druid-datasketches',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.query.aggregation.datasketches.quantiles.DoublesSketchModule',\n",
+       "   'artifact': 'druid-datasketches',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.query.aggregation.datasketches.tuple.ArrayOfDoublesSketchModule',\n",
+       "   'artifact': 'druid-datasketches',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.query.aggregation.datasketches.hll.HllSketchModule',\n",
+       "   'artifact': 'druid-datasketches',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.query.aggregation.datasketches.kll.KllSketchModule',\n",
+       "   'artifact': 'druid-datasketches',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.msq.guice.MSQExternalDataSourceModule',\n",
+       "   'artifact': 'druid-multi-stage-query',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.msq.guice.MSQIndexingModule',\n",
+       "   'artifact': 'druid-multi-stage-query',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.msq.guice.MSQDurableStorageModule',\n",
+       "   'artifact': 'druid-multi-stage-query',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.msq.guice.MSQServiceClientModule',\n",
+       "   'artifact': 'druid-multi-stage-query',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.msq.guice.MSQSqlModule',\n",
+       "   'artifact': 'druid-multi-stage-query',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.msq.guice.SqlTaskModule',\n",
+       "   'artifact': 'druid-multi-stage-query',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.server.lookup.namespace.NamespaceExtractionModule',\n",
+       "   'artifact': 'druid-lookups-cached-global',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 
'org.apache.druid.catalog.guice.CatalogCoordinatorModule',\n",
+       "   'artifact': 'druid-catalog',\n",
+       "   'version': '26.0.0-SNAPSHOT'},\n",
+       "  {'name': 'org.apache.druid.catalog.guice.CatalogBrokerModule',\n",
+       "   'artifact': 'druid-catalog',\n",
+       "   'version': '26.0.0-SNAPSHOT'}],\n",
+       " 'memory': {'maxMemory': 134217728,\n",
+       "  'totalMemory': 134217728,\n",
+       "  'freeMemory': 80642696,\n",
+       "  'usedMemory': 53575032,\n",
+       "  'directMemory': 134217728}}"
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "rest_client.get_json('/status')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "837e08b0",
+   "metadata": {},
+   "source": [
+    "A quick comparison of the three approaches (Requests, REST client, Python 
client):\n",
+    "\n",
+    "Status:\n",
+    "* Requests: `session.get(druid_host + '/status').json()`\n",
+    "* REST client: `rest_client.get_json('/status')`\n",
+    "* Status client: `status_client.status()`\n",
+    "\n",
+    "Health:\n",
+    "* Requests: `session.get(druid_host + '/status/health').json()`\n",
+    "* REST client: `rest_client.get_json('/status/health')`\n",
+    "* Status client: `status_client.is_healthy()`\n",
+    "\n",
+    "Ingest data:\n",
+    "* Requests: See the [REST tutorial](api_tutorial.ipynb)\n",
+    "* REST client: as the REST tutorial, but use 
`rest_client.post_json('/druid/v2/sql/task', sql_request)` and\n",
+    "  
`rest_client.get_json(f\"/druid/indexer/v1/task/{ingestion_taskId}/status\")`\n",
+    "* SQL client: `sql_client.run_task(sql)`, also a form for a full SQL 
request.\n",
+    "\n",
+    "List datasources:\n",
+    "* Requests: `session.get(druid_host + 
'/druid/coordinator/v1/datasources').json()`\n",
+    "* REST client: 
`rest_client.get_json('/druid/coordinator/v1/datasources')`\n",
+    "* Datasources client: `ds_client.names()`\n",
+    "\n",
+    "Query data:\n",
+    "* Requests: `session.get(druid_host + '/druid/v2/sql', 
json=sql_request).json()`\n",
+    "* REST client: `rest_client.get_json('/druid/v2/sql', sql_request)`\n",

Review Comment:
   and if there's a query with no headers etc, I had to use `post_only_json` to 
get it to work



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a diff in pull request #13787: Python Druid API for use in notebooks

Reply via email to