[GitHub] [druid] writer-jill commented on a diff in pull request #13718: Nested columns tutorial

via GitHub Mon, 30 Jan 2023 04:01:41 -0800


writer-jill commented on code in PR #13718:
URL: https://github.com/apache/druid/pull/13718#discussion_r1090534459



##########
examples/quickstart/jupyter-notebooks/nested-columns-tutorial.ipynb:
##########
@@ -0,0 +1,517 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ad4e60b6",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "# Working with nested columns\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "This tutorial demonstrates how to work with [nested 
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in 
Apache Druid.\n",
+    "\n",
+    "Druid stores nested data structures in `COMPLEX<json>` columns. In this 
tutorial you perform the following tasks:\n",
+    "\n",
+    "- Ingest nested JSON data using SQL-based ingestion.\n",
+    "- Transform nested data during ingestion using SQL JSON functions.\n",
+    "- Perform queries to display, filter, and aggregate nested data.\n",
+    "\n",
+    "Druid supports directly ingesting nested data with the following formats: 
JSON, Parquet, Avro, ORC, Protobuf.\n",
+    "\n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Transform data](#Transform-data)\n",
+    "- [Query nested data](#Query-nested-data)\n",
+    "- [Learn more](#Learn-more)\n",
+    "\n",
+    "For the best experience, use Jupyter Lab so that you can always access 
the table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e4638f98",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "You'll need install the Requests library for Python before you start. For 
example:\n",
+    "\n",
+    "```bash\n",
+    "pip3 install requests\n",
+    "```\n",
+    "\n",
+    "Next, you'll need a Druid cluster. This tutorial uses the 
`micro-quickstart` config described in the [Druid 
quickstart](https://druid.apache.org/docs/latest/tutorials/index.html). So 
download that and start it if you haven't already. In the root of the Druid 
folder, run the following command to start Druid:\n",
+    "\n",
+    "```bash\n",
+    "./bin/start-druid\n",
+    "```\n",
+    "\n",
+    "Finally, you'll need either Jupyter lab (recommended) or Jupyter 
notebook. Both the quickstart Druid cluster and Jupyter notebook are deployed 
at `localhost:8888` by default, so you'll need to change the port for Jupyter. 
To do so, stop Jupyter and start it again with the `port` parameter included. 
For example, you can use the following command to start Jupyter on port 
`3001`:\n",
+    "\n",
+    "```bash\n",
+    "# If you're using Jupyter lab\n",
+    "jupyter lab --port 3001\n",
+    "# If you're using Jupyter notebook\n",
+    "jupyter notebook --port 3001 \n",
+    "```\n",
+    "\n",
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host the tutorial uses. 
The quickstart deployment configures Druid to listen on port `8888` by default, 
so you'll be making API calls against `http://localhost:8888`.";
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7f08a52",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In a distributed environment, use the Router service  as the 
`druid_host`. \n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "dataSource1 = \"kttm_one\"\n",
+    "dataSource2 = \"kttm_transform\"\n",
+    "print(f\"\\033[1mDruid host\\033[0m: {druid_host}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e893ef7d-7136-442f-8bd9-31b5a5276518",
+   "metadata": {},
+   "source": [
+    "In the rest of the tutorial, the `endpoint`, `http_method`, and `payload` 
variables are updated to accomplish different tasks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebd8c7db-c39f-4ef7-86ec-81f405e02550",
+   "metadata": {},
+   "source": [
+    "## Ingest nested data\n",
+    "\n",
+    "Run the following cell to ingest sample clickstream data from the [Koalas 
to the Max](https://www.koalastothemax.com/) game."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75a177f1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = \"/druid/v2/sql/task\"\n",
+    "print(druid_host+endpoint)\n",
+    "http_method = \"POST\"\n",
+    "\n",
+    "payload = json.dumps({\n",
+    "\"query\": \"INSERT INTO \\\"kttm_one\\\" \\\n",
+    "    WITH \\\"source\\\" AS \\\n",
+    "    (SELECT * FROM 
TABLE(EXTERN('{\\\"type\\\":\\\"http\\\",\\\"uris\\\":[\\\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\\\"]}',
 \\\n",
+    "       
'{\\\"type\\\":\\\"json\\\"}','[{\\\"name\\\":\\\"timestamp\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"client_ip\\\",\\\"type\\\":\\\"string\\\"},
 \\\n",
+    "        
{\\\"name\\\":\\\"session\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"session_length\\\",\\\"type\\\":\\\"string\\\"},{\\\"name\\\":\\\"event\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},
 \\\n",
+    "        
{\\\"name\\\":\\\"agent\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"},{\\\"name\\\":\\\"geo_ip\\\",\\\"type\\\":\\\"COMPLEX<json>\\\"}]')))
 \\\n",
+    "        SELECT TIME_PARSE(\\\"timestamp\\\") AS \\\"__time\\\", 
\\\"client_ip\\\", \\\"session\\\", \\\"session_length\\\", \\\"event\\\", 
\\\"agent\\\", \\\"geo_ip\\\"FROM \\\"source\\\" \\\n",
+    "    PARTITIONED BY DAY\"\n",
+    "})\n",
+    "    \n",
+    "headers = {'Content-Type': 'application/json'}\n",
+    "\n",
+    "response = requests.request(http_method, druid_host+endpoint, 
headers=headers, data=payload)\n",
+    "ingestion_taskId_response = response\n",
+    "ingestion_taskId = 
json.loads(ingestion_taskId_response.text)['taskId']\n",
+    "\n",
+    "print(f\"\\nInserting data into the table named {dataSource1}\")\n",

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] writer-jill commented on a diff in pull request #13718: Nested columns tutorial

Reply via email to