[GitHub] [druid] 317brian commented on a diff in pull request #13718: Nested columns tutorial

via GitHub Tue, 31 Jan 2023 11:44:12 -0800


317brian commented on code in PR #13718:
URL: https://github.com/apache/druid/pull/13718#discussion_r1092398018



##########
examples/quickstart/jupyter-notebooks/nested-columns-tutorial.ipynb:
##########
@@ -0,0 +1,556 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Working with nested columns\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "This tutorial demonstrates how to work with [nested 
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in 
Apache Druid.\n",
+    "\n",
+    "Druid stores nested data structures in `COMPLEX<json>` columns. In this 
tutorial, you perform the following tasks:\n",
+    "\n",
+    "- Ingest nested JSON data using SQL-based ingestion.\n",
+    "- Transform nested data during ingestion using SQL JSON functions.\n",
+    "- Perform queries to display, filter, and aggregate nested data.\n",
+    "\n",
+    "Druid supports directly ingesting nested data with the following formats: 
JSON, Parquet, Avro, ORC, Protobuf.\n",
+    "\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Ingest nested data](#Ingest-nested-data)\n",
+    "- [Transform nested data](#Transform-nested-data)\n",
+    "- [Query nested data](#Query-nested-data)\n",
+    "- [Learn more](#Learn-more)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4638f98",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You need to install the requests library for Python before you 
start&mdash;for example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you need a Druid cluster. This tutorial uses the configuration 
described in the Druid [Quickstart 
(local)](https://druid.apache.org/docs/latest/tutorials/index.html). Download 
Druid from the quickstart page. In the root of the Druid folder, run the 
following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-druid\n",
+    "```\n",
+    "\n",
+    "Finally, you need either JupyterLab (recommended) or Jupyter Notebook. 
Visit the [Jupyter site](https://jupyter.org/) if you want to learn more about 
these interfaces.\n",
+    "\n",
+    "Both the quickstart Druid cluster and Jupyter deploy at `localhost:8888` 
by default, so you need to change the port for Jupyter. To do this, stop 
Jupyter if it's running and start it with the `port` parameter included. For 
example, you can use the following command to start Jupyter on port `3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using JupyterLab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter Notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you need and defines variables for two datasources and the Druid host the 
tutorial uses. The quickstart deployment configures Druid to listen on port 
`8888` by default, so you'll make API calls against `http://localhost:8888`.";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSource1 = \"kttm\"\n",
+    "dataSource2 = \"kttm_transform\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+   "metadata": {},
+   "source": [
+    "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` 
variables are updated to accomplish different tasks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+   "metadata": {},
+   "source": [
+    "## Ingest nested data\n",
+    "\n",
+    "Run the following cell to ingest sample clickstream data from the [Koalas 
to the Max](https://www.koalastothemax.com/) game."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75a177f1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(druid_host+endpoint)\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO \\\"kttm\\\" \\\n",
+    "    WITH \\\"source\\\" AS \\\n",
+    "    (SELECT * FROM 
TABLE(EXTERN('{\\\"type\\\":\\\"http\\\",\\\"uris\\\":[\\\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\\\"]}',
 \\\n",
+    "       
'{\\\"type\\\":\\\"json\\\"}','[{\\\"name\\\":\\\"timestamp\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"client_ip\\\",\\\"type\\\":\\\"string\\\"},
 \\\n",
+    "        
{\\\"name\\\":\\\"session\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session_length\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"event\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},
 \\\n",
+    "        
{\\\"name\\\":\\\"agent\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"geo_ip\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"}]')))
 \\\n",
+    "        SELECT TIME_PARSE(\\\"timestamp\\\") AS \\\"__time\\\", 
\\\"client_ip\\\", \\\"session\\\", \\\"session_length\\\", \\\"event\\\", 
\\\"agent\\\", \\\"geo_ip\\\"FROM \\\"source\\\" \\\n",
+    "    PARTITIONED BY DAY\"\n",
+    "})\n",
+    "    \n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_taskId_response = response\n",
+    "ingestion_taskId = 
json.loads(ingestion_taskId_response.text)['taskId']\n",
+    "\n",
+    "print(f\"\\nInserting data into the table named {dataSource1}.\")\n",
+    "print(\"\\nThe response includes the task ID and the status: \" + 
response.text + \".\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17c19a74",
+   "metadata": {},
+   "source": [
+    "Run the following cell to get the status of the ingestion task."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eeb78251",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "endpoint = f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+    "print(f\"\\033[1mQuery endpoint\\033[0m: {druid_host+endpoint}\")\n",
+    "http_method = \"GET\"\n",
+    "\n",
+    "payload = {}\n",
+    "headers = {}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_status = json.loads(response.text)['status']['status']\n",
+    "\n",
+    "if ingestion_status == \"RUNNING\":\n",
+    "  print(\"The ingestion is running...\")\n",
+    "\n",
+    "while ingestion_status != \"SUCCESS\":\n",
+    "  response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "  ingestion_status = json.loads(response.text)['status']['status']\n",
+    "  time.sleep(15)  \n",
+    "  \n",
+    "if ingestion_status == \"SUCCESS\": \n",
+    "  print(\"The ingestion is complete:\")\n",
+    "  print(json.dumps(response.json(), indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a2284f3",
+   "metadata": {},
+   "source": [
+    "When the ingestion task status shows `SUCCESS`, run the following cell to 
query the data and return selected columns from 3 rows. Note the nested 
structure of the `event`, `agent`, and `geo_ip` columns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e801999d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql\"\n",
+    "print(druid_host+endpoint)\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "  \"query\": \"SELECT session, event, agent, geo_ip FROM kttm LIMIT 
3\"\n",
+    "})\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "\n",
+    "print(json.dumps(json.loads(response.text), indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "240b0ad5-48f2-4737-b12b-5fd5f98da300",
+   "metadata": {},
+   "source": [
+    "## Transform nested data\n",
+    "\n",
+    "You can use Druid's [SQL JSON 
functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
 to transform nested data in your ingestion query.\n",
+    "\n",
+    "Run the following cell to insert sample data into a new datasource named 
`kttm_transform`. The SELECT query extracts the `country` and `city` elements 
from the nested `geo_ip` column and creates a composite object `sessionDetails` 
containing  `session` and `session_length`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1f21fa8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(druid_host+endpoint)\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO \\\"kttm_transform\\\" \\\n",
+    "    WITH \\\"source\\\" AS \\\n",
+    "    (SELECT * FROM 
TABLE(EXTERN('{\\\"type\\\":\\\"http\\\",\\\"uris\\\":[\\\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\\\"]}',
 \\\n",
+    "       
'{\\\"type\\\":\\\"json\\\"}','[{\\\"name\\\":\\\"timestamp\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session_length\\\",\\\"type\\\":\\\"string\\\"},
 \\\n",
+    "        
{\\\"name\\\":\\\"event\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"agent\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"geo_ip\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"}]')))
 \\\n",
+    "        SELECT TIME_PARSE(\\\"timestamp\\\") AS \\\"__time\\\", \\\n",
+    "        JSON_QUERY(geo_ip, '$.country') as country, \\\n",
+    "        JSON_QUERY(geo_ip, '$.city') as city, \\\n",
+    "        JSON_OBJECT('session':session, 'session_length':session_length) 
as sessionDetails \\\n",
+    "        FROM \\\"source\\\" \\\n",
+    "    PARTITIONED BY DAY\"\n",
+    "})\n",
+    "\n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_taskId_response = response\n",
+    "ingestion_taskId = 
json.loads(ingestion_taskId_response.text)['taskId']\n",
+    "\n",
+    "print(f\"\\nInserting data into the table named {dataSource2}\")\n",
+    "print(\"\\nThe response includes the task ID and the status: \" + 
response.text + \".\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4a72f7a9-d7f3-4987-bce5-cb482369b1c3",
+   "metadata": {},
+   "source": [
+    "Run the following cell to get the status of the ingestion task."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fd275148-b930-4626-a433-e34a1c308f03",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",

Review Comment:
   You don't need to import time a second time. The cell that also had the 
sleep loop already imported it. 



##########
examples/quickstart/jupyter-notebooks/nested-columns-tutorial.ipynb:
##########
@@ -57,28 +57,30 @@
    "source": [
     "## Prerequisites\n",
     "\n",
-    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "You need to install the requests library for Python before you 
start&mdash;for example:\n",

Review Comment:
   The official name of the library uses `R` not `r`. It's kind of hard to 
notice on t heir docs/repo cause they keep putting it at the start of a 
sentence 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] 317brian commented on a diff in pull request #13718: Nested columns tutorial

Reply via email to