[GitHub] [druid] ektravel commented on a diff in pull request #13465: docs: notebook only for SQL tutorial

via GitHub Tue, 24 Jan 2023 12:57:11 -0800


ektravel commented on code in PR #13465:
URL: https://github.com/apache/druid/pull/13465#discussion_r1085914389



##########
examples/quickstart/jupyter-notebooks/sql-tutorial.ipynb:
##########
@@ -0,0 +1,765 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "deletable": true,
+    "editable": true,
+    "tags": []
+   },
+   "source": [
+    "# Tutorial: Learn the basics of Druid SQL\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "  \n",
+    "Apache Druid supports two query languages: Druid SQL and native 
queries.\n",
+    "Druid SQL is a Structured Query Language (SQL) dialect that enables you 
to query datasources in Apache Druid using SQL statements.\n",
+    "SQL and Druid SQL use similar syntax, with some notable differences.\n",
+    "Not all SQL functions are supported in Druid SQL. Instead, Druid includes 
Druid-specific SQL functions for optimized query performance.\n",
+    "\n",
+    "This interactive tutorial introduces you to the unique aspects of Druid 
SQL.\n",
+    "To learn about native queries, see [Native 
queries](https://druid.apache.org/docs/latest/querying/querying.html)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d6bbbcb",
+   "metadata": {
+    "deletable": true,
+    "tags": []
+   },
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Make sure that you meet the requirements outlined in the README.md file 
of the [apache/druid 
repo](https://github.com/apache/druid/tree/master/examples/quickstart/jupyter-notebooks/).\n",
+    "Specifically, you need the following:\n",
+    "- Knowledge of SQL\n",
+    "- [Python3](https://www.python.org/downloads/)\n",
+    "- [The `requests` package for 
Python](https://requests.readthedocs.io/en/latest/user/install/)\n",
+    "- [JupyterLab](https://jupyter.org/install#jupyterlab) (recommended) or 
[Jupyter Notebook](https://jupyter.org/install#jupyter-notebook) running on a 
non-default port. Druid and Jupyter both default to port `8888`, so you need to 
start Jupyter on a different port. \n",
+    "- An available Druid instance. This tutorial uses the `micro-quickstart` 
configuration described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html), so no 
authentication or authorization is required unless explicitly mentioned. If you 
haven’t already, download Druid version 24.0 or higher and start Druid services 
as described in the quickstart."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f8e64f0-c29a-473c-8783-a2ff8648acd7",
+   "metadata": {},
+   "source": [
+    "## Prepare your environment\n",
+    "\n",
+    "This section contains the steps required to prepare your environment to 
follow along with this tutorial.\n",
+    "\n",
+    "Start by running the following cell. It imports the required Python 
packages and defines a variable for the Druid host."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSourceName = \"wikipedia-sql-tutorial\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+   "metadata": {},
+   "source": [
+    "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` 
variables are updated to accomplish different tasks.\n",
+    "\n",
+    "Run the following cell to ingest data from an external source into a 
table named `wikipedia-sql-tutorial` using the [multi-stage query (MSQ) task 
engine](https://druid.apache.org/docs/latest/multi-stage-query/index.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "045f782c-74d8-4447-9487-529071812b51",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO \\\"wikipedia-sql-tutorial\\\" SELECT 
TIME_PARSE(\\\"timestamp\\\") \\\n",
+    "          AS __time, * FROM TABLE \\\n",
+    "          (EXTERN('{\\\"type\\\": \\\"http\\\", \\\"uris\\\": 
[\\\"https://druid.apache.org/data/wikipedia.json.gz\\\"]}', '{\\\"type\\\": 
\\\"json\\\"}', '[{\\\"name\\\": \\\"added\\\", \\\"type\\\": \\\"long\\\"}, 
{\\\"name\\\": \\\"channel\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"cityName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"comment\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"commentLength\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": 
\\\"countryIsoCode\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"countryName\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"deleted\\\", \\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"delta\\\", 
\\\"type\\\": \\\"long\\\"}, {\\\"name\\\": \\\"deltaBucket\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"diffUrl\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"flags\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isAnonymous\\\", \\\"type\\\": \\\"string\\\"}, {\\\
 "name\\\": \\\"isMinor\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"isNew\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isRobot\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"isUnpatrolled\\\", 
\\\"type\\\": \\\"string\\\"}, {\\\"name\\\": \\\"metroCode\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"namespace\\\", \\\"type\\\": 
\\\"string\\\"}, {\\\"name\\\": \\\"page\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionIsoCode\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"regionName\\\", \\\"type\\\": \\\"string\\\"}, 
{\\\"name\\\": \\\"timestamp\\\", \\\"type\\\": \\\"string\\\"}, {\\\"name\\\": 
\\\"user\\\", \\\"type\\\": \\\"string\\\"}]')) \\\n",
+    "          PARTITIONED BY DAY\",\n",
+    "  \"context\": {\n",
+    "    \"maxNumTasks\": 3\n",
+    "  }\n",
+    "})\n",
+    "\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_taskId_response = response\n",
+    "ingestion_taskId = 
json.loads(ingestion_taskId_response.text)['taskId']\n",
+    "print(f\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+    "print(f\"\\nInserting data into the table named {dataSourceName}\")\n",
+    "print(\"\\nThe response includes the task ID and the status: \" + 
response.text + \".\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ceb86ce0-85f6-4c63-8fd6-883033ee96e9",
+   "metadata": {},
+   "source": [
+    "Wait for ingestion to complete before proceeding.\n",
+    "To check on the status of your ingestion task, run the following cell. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "df12d12c-a067-4759-bae0-0410c24b6205",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_status = json.loads(response.text)['status']['status']\n",
+    "# If you only want to fetch the status once and print it, \n",
+    "# uncomment the print statement and comment out the if and while loops\n",
+    "# print(json.dumps(response.json(), indent=4))\n",
+    "\n",
+    "if ingestion_status == \"RUNNING\":\n",
+    "  print(\"The ingestion is running...\")\n",
+    "\n",
+    "while ingestion_status != \"SUCCESS\":\n",
+    "  response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "  ingestion_status = json.loads(response.text)['status']['status']\n",
+    "  time.sleep(15)  \n",
+    "  \n",
+    "if ingestion_status == \"SUCCESS\": \n",
+    "  print(\"The ingestion is complete:\")\n",
+    "  print(json.dumps(response.json(), indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "240b0ad5-48f2-4737-b12b-5fd5f98da300",
+   "metadata": {},
+   "source": [
+    "## Datasources\n",
+    "\n",
+    "Druid supports a variety of datasources, with the table datasource being 
the most common. In Druid documentation, the word \"datasource\" often 
implicitly refers to the table datasource.\n",
+    "The 
[Datasources](https://druid.apache.org/docs/latest/querying/datasource.html) 
topic provides a comprehensive overview of datasources supported by Druid 
SQL.\n",
+    "\n",
+    "In Druid SQL, table datasources reside in the `druid` schema. This is the 
default schema, so table datasources can be referenced as either 
`druid.dataSourceName` or `dataSourceName`.\n",
+    "\n",
+    "For example, run the next cell to return the rows of the column named 
`channel` from the `wikipedia-sql-tutorial` table. Because this tutorial is 
running in Jupyter, make sure to limit the size of your query results for 
display purposes using the SQL LIMIT clause."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "91dd255a-4d55-493e-a067-4cef5c659657",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "  \"query\":\"SELECT \\\"channel\\\" FROM \\\"wikipedia-sql-tutorial\\\" 
LIMIT 7\"})\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+    "print(f\"\\nEach JSON object in the response represents a row in the 
{dataSourceName} datasource.\") \n",
+    "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(), 
indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbeb5a63",
+   "metadata": {
+    "deletable": true,
+    "tags": []
+   },
+   "source": [
+    "## Data types\n",
+    "\n",
+    "Druid maps SQL data types onto native types at query runtime.\n",
+    "The following native types are supported for Druid columns:\n",
+    "\n",
+    "* STRING: UTF-8 encoded strings and string arrays\n",
+    "* LONG: 64-bit signed int\n",
+    "* FLOAT: 32-bit float\n",
+    "* DOUBLE: 64-bit float\n",
+    "* COMPLEX: represents non-standard data types, such as nested JSON, 
hyperUnique and approxHistogram aggregators, and DataSketches aggregators\n",
+    "\n",
+    "Druid exposes table and column metadata through 
[INFORMATION_SCHEMA](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#information-schema)
 tables. Run the following query to retrieve metadata for the 
`wikipedia-sql-tutorial` datasource. In the response body, each JSON object 
correlates to a column in the table.\n",
+    "Check the objects' `DATA_TYPE` property for SQL data types. You should 
see TIMESTAMP, BIGINT, and VARCHAR SQL data types. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9227d6c-1d8c-4169-b13b-a08625c4011f",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "  \"query\":\"SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE 
\\\"TABLE_SCHEMA\\\" = 'druid' AND \\\"TABLE_NAME\\\" = 
'wikipedia-sql-tutorial' LIMIT 7\"\n",
+    "})\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(\"\\033[1mQuery\\033[0m:\\n\" + payload)\n",
+    "print(\"\\n\\033[1mResponse\\033[0m: \\n\" + json.dumps(response.json(), 
indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c59ca797-dd91-442b-8d02-67b711b3fcc6",
+   "metadata": {},
+   "source": [
+    "Druid natively interprets VARCHAR as STRING and BIGINT and TIMESTAMP SQL 
data types as LONG. For reference on how SQL data types map onto Druid native 
types, see [Standard 
types](https://druid.apache.org/docs/latest/querying/sql-data-types.html#standard-types).\n",
+    "\n",
+    "### Timestamp values\n",
+    "\n",
+    "Druid stores timestamp values as the number of milliseconds since the 
Unix epoch.\n",
+    "Primary timestamps are stored in a column named `__time`.\n",
+    "If a dataset doesn't have a timestamp, Druid uses the default value of 
`1970-01-01 00:00:00`.\n",
+    "\n",
+    "Druid time functions perform best when used with the `__time` column.\n",
+    "By default, time functions use the UTC time zone.\n",
+    "For more information about timestamp handling, see [Date and time 
functions](https://druid.apache.org/docs/latest/querying/sql-scalar.html#date-and-time-functions).\n",
+    "\n",
+    "Run the following cell to see Druid SQL `TIME_IN_INTERVAL` scalar 
function at work. This query checks whether a timestamp is contained within a 
specified interval. The results are grouped by date."

Review Comment:
   I've updated the text and the example code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] ektravel commented on a diff in pull request #13465: docs: notebook only for SQL tutorial

Reply via email to