317brian commented on code in PR #13718:
URL: https://github.com/apache/druid/pull/13718#discussion_r1092398018
##########
examples/quickstart/jupyter-notebooks/nested-columns-tutorial.ipynb:
##########
@@ -0,0 +1,556 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "ad4e60b6",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Working with nested columns\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "This tutorial demonstrates how to work with [nested
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in
Apache Druid.\n",
+ "\n",
+ "Druid stores nested data structures in `COMPLEX<json>` columns. In this
tutorial, you perform the following tasks:\n",
+ "\n",
+ "- Ingest nested JSON data using SQL-based ingestion.\n",
+ "- Transform nested data during ingestion using SQL JSON functions.\n",
+ "- Perform queries to display, filter, and aggregate nested data.\n",
+ "\n",
+ "Druid supports directly ingesting nested data with the following formats:
JSON, Parquet, Avro, ORC, Protobuf.\n",
+ "\n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest nested data](#Ingest-nested-data)\n",
+ "- [Transform nested data](#Transform-nested-data)\n",
+ "- [Query nested data](#Query-nested-data)\n",
+ "- [Learn more](#Learn-more)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e4638f98",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "You need to install the requests library for Python before you
start—for example:\n",
+ "\n",
+ "```bash\n",
+ "pip3 install requests\n",
+ "```\n",
+ "\n",
+ "Next, you need a Druid cluster. This tutorial uses the configuration
described in the Druid [Quickstart
(local)](https://druid.apache.org/docs/latest/tutorials/index.html). Download
Druid from the quickstart page. In the root of the Druid folder, run the
following command to start Druid:\n",
+ "\n",
+ "```bash\n",
+ "./bin/start-druid\n",
+ "```\n",
+ "\n",
+ "Finally, you need either JupyterLab (recommended) or Jupyter Notebook.
Visit the [Jupyter site](https://jupyter.org/) if you want to learn more about
these interfaces.\n",
+ "\n",
+ "Both the quickstart Druid cluster and Jupyter deploy at `localhost:8888`
by default, so you need to change the port for Jupyter. To do this, stop
Jupyter if it's running and start it with the `port` parameter included. For
example, you can use the following command to start Jupyter on port `3001`:\n",
+ "\n",
+ "```bash\n",
+ "# If you're using JupyterLab\n",
+ "jupyter lab --port 3001\n",
+ "# If you're using Jupyter Notebook\n",
+ "jupyter notebook --port 3001 \n",
+ "```\n",
+ "\n",
+ "To start this tutorial, run the next cell. It imports the Python packages
you need and defines variables for two datasources and the Druid host the
tutorial uses. The quickstart deployment configures Druid to listen on port
`8888` by default, so you'll make API calls against `http://localhost:8888`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b7f08a52",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In a distributed environment, use the Router service as the
`druid_host`. \n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "dataSource1 = \"kttm\"\n",
+ "dataSource2 = \"kttm_transform\"\n",
+ "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+ "metadata": {},
+ "source": [
+ "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload`
variables are updated to accomplish different tasks."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+ "metadata": {},
+ "source": [
+ "## Ingest nested data\n",
+ "\n",
+ "Run the following cell to ingest sample clickstream data from the [Koalas
to the Max](https://www.koalastothemax.com/) game."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "75a177f1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql/task\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ "\"query\": \"INSERT INTO \\\"kttm\\\" \\\n",
+ " WITH \\\"source\\\" AS \\\n",
+ " (SELECT * FROM
TABLE(EXTERN('{\\\"type\\\":\\\"http\\\",\\\"uris\\\":[\\\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\\\"]}',
\\\n",
+ "
'{\\\"type\\\":\\\"json\\\"}','[{\\\"name\\\":\\\"timestamp\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"client_ip\\\",\\\"type\\\":\\\"string\\\"},
\\\n",
+ "
{\\\"name\\\":\\\"session\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session_length\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"event\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},
\\\n",
+ "
{\\\"name\\\":\\\"agent\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"geo_ip\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"}]')))
\\\n",
+ " SELECT TIME_PARSE(\\\"timestamp\\\") AS \\\"__time\\\",
\\\"client_ip\\\", \\\"session\\\", \\\"session_length\\\", \\\"event\\\",
\\\"agent\\\", \\\"geo_ip\\\"FROM \\\"source\\\" \\\n",
+ " PARTITIONED BY DAY\"\n",
+ "})\n",
+ " \n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "ingestion_taskId_response = response\n",
+ "ingestion_taskId =
json.loads(ingestion_taskId_response.text)['taskId']\n",
+ "\n",
+ "print(f\"\\nInserting data into the table named {dataSource1}.\")\n",
+ "print(\"\\nThe response includes the task ID and the status: \" +
response.text + \".\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "17c19a74",
+ "metadata": {},
+ "source": [
+ "Run the following cell to get the status of the ingestion task."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eeb78251",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
+ "\n",
+ "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+ "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+ "http_method = \"GET\"\n",
+ "\n",
+ "payload = {}\n",
+ "headers = {}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "ingestion_status = json.loads(response.text)['status']['status']\n",
+ "\n",
+ "if ingestion_status == \"RUNNING\":\n",
+ " print(\"The ingestion is running...\")\n",
+ "\n",
+ "while ingestion_status != \"SUCCESS\":\n",
+ " response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ " ingestion_status = json.loads(response.text)['status']['status']\n",
+ " time.sleep(15) \n",
+ " \n",
+ "if ingestion_status == \"SUCCESS\": \n",
+ " print(\"The ingestion is complete:\")\n",
+ " print(json.dumps(response.json(), indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9a2284f3",
+ "metadata": {},
+ "source": [
+ "When the ingestion task status shows `SUCCESS`, run the following cell to
query the data and return selected columns from 3 rows. Note the nested
structure of the `event`, `agent`, and `geo_ip` columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e801999d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT session, event, agent, geo_ip FROM kttm LIMIT
3\"\n",
+ "})\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "\n",
+ "print(json.dumps(json.loads(response.text), indent=4))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "240b0ad5-48f2-4737-b12b-5fd5f98da300",
+ "metadata": {},
+ "source": [
+ "## Transform nested data\n",
+ "\n",
+ "You can use Druid's [SQL JSON
functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
to transform nested data in your ingestion query.\n",
+ "\n",
+ "Run the following cell to insert sample data into a new datasource named
`kttm_transform`. The SELECT query extracts the `country` and `city` elements
from the nested `geo_ip` column and creates a composite object `sessionDetails`
containing `session` and `session_length`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a1f21fa8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = \"/druid/v2/sql/task\"\n",
+ "print(druid_host+endpoint)\n",
+ "http_method = \"POST\"\n",
+ "\n",
+ "payload = json.dumps({\n",
+ "\"query\": \"INSERT INTO \\\"kttm_transform\\\" \\\n",
+ " WITH \\\"source\\\" AS \\\n",
+ " (SELECT * FROM
TABLE(EXTERN('{\\\"type\\\":\\\"http\\\",\\\"uris\\\":[\\\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\\\"]}',
\\\n",
+ "
'{\\\"type\\\":\\\"json\\\"}','[{\\\"name\\\":\\\"timestamp\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session_length\\\",\\\"type\\\":\\\"string\\\"},
\\\n",
+ "
{\\\"name\\\":\\\"event\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"agent\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"geo_ip\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"}]')))
\\\n",
+ " SELECT TIME_PARSE(\\\"timestamp\\\") AS \\\"__time\\\", \\\n",
+ " JSON_QUERY(geo_ip, '$.country') as country, \\\n",
+ " JSON_QUERY(geo_ip, '$.city') as city, \\\n",
+ " JSON_OBJECT('session':session, 'session_length':session_length)
as sessionDetails \\\n",
+ " FROM \\\"source\\\" \\\n",
+ " PARTITIONED BY DAY\"\n",
+ "})\n",
+ "\n",
+ "headers = {'Content-Type': 'application/json'}\n",
+ "\n",
+ "response = requests.request(http_method, druid_host+endpoint,
headers=headers, data=payload)\n",
+ "ingestion_taskId_response = response\n",
+ "ingestion_taskId =
json.loads(ingestion_taskId_response.text)['taskId']\n",
+ "\n",
+ "print(f\"\\nInserting data into the table named {dataSource2}\")\n",
+ "print(\"\\nThe response includes the task ID and the status: \" +
response.text + \".\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4a72f7a9-d7f3-4987-bce5-cb482369b1c3",
+ "metadata": {},
+ "source": [
+ "Run the following cell to get the status of the ingestion task."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fd275148-b930-4626-a433-e34a1c308f03",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import time\n",
Review Comment:
You don't need to import time a second time. The cell that also had the
sleep loop already imported it.
##########
examples/quickstart/jupyter-notebooks/nested-columns-tutorial.ipynb:
##########
@@ -57,28 +57,30 @@
"source": [
"## Prerequisites\n",
"\n",
- "You'll need install the Requests library for Python before you start. For
example:\n",
+ "You need to install the requests library for Python before you
start—for example:\n",
Review Comment:
The official name of the library uses `R` not `r`. It's kind of hard to
notice on t heir docs/repo cause they keep putting it at the start of a
sentence
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]