This is an automated email from the ASF dual-hosted git repository.

techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git


The following commit(s) were added to refs/heads/master by this push:
     new 2561477e87 Jupyter nested columns tutorial (#14788)
2561477e87 is described below

commit 2561477e875e3a8161f45e291f2f961496b23f6e
Author: Jill Osborne <[email protected]>
AuthorDate: Wed Aug 16 22:45:37 2023 +0100

    Jupyter nested columns tutorial (#14788)
---
 .../02-working-with-nested-columns.ipynb           | 426 +++++++++++++++++++++
 1 file changed, 426 insertions(+)

diff --git 
a/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb
 
b/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb
new file mode 100644
index 0000000000..f4538607c4
--- /dev/null
+++ 
b/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb
@@ -0,0 +1,426 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Working with nested columns\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "This tutorial demonstrates how to work with [nested 
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in 
Apache Druid.\n",
+    "\n",
+    "Druid stores nested data structures in `COMPLEX<json>` columns. In this 
tutorial you perform the following tasks:\n",
+    "\n",
+    "- Ingest nested JSON data using SQL-based ingestion.\n",
+    "- Transform nested data during ingestion using SQL JSON functions.\n",
+    "- Perform queries to display, filter, and aggregate nested data.\n",
+    "- Use helper operators to examine nested data and plan your queries.\n",
+    "\n",
+    "Druid supports directly ingesting nested data with the following formats: 
JSON, Parquet, Avro, ORC, Protobuf."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Initialization](#Initialization)\n",
+    "- [Ingest nested data](#Ingest-nested-data)\n",
+    "- [Transform nested data](#Transform-nested-data)\n",
+    "- [Query nested data](#Query-nested-data)\n",
+    "- [Group, filter, and aggregate nested 
data](#Group-filter-and-aggregate-nested-data)\n",
+    "- [Use helper operators](#Use-helper-operators)\n",
+    "- [Learn more](#Learn-more)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 25.0.0 or later.\n",
+    "\n",
+    "### Run with Docker\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter` 
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more 
information, see [Docker for Jupyter Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "### Run without Docker\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Apache Druid instance, with a `DRUID_HOST` local environment 
variable containing the server name of your Druid router.\n",
+    "* 
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
 a Python client for Apache Druid. Follow the instructions in the Install 
section of the README file."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialization"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the next cell to set up the Druid Python client's connection to 
Apache Druid.\n",
+    "\n",
+    "If successful, the Druid version number will be shown in the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import druidapi\n",
+    "import os\n",
+    "\n",
+    "if 'DRUID_HOST' not in os.environ.keys():\n",
+    "    druid_host=f\"http://localhost:8888\"\n";,
+    "else:\n",
+    "    druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+    "    \n",
+    "print(f\"Opening a connection to {druid_host}.\")\n",
+    "druid = druidapi.jupyter_client(druid_host)\n",
+    "\n",
+    "display = druid.display\n",
+    "sql_client = druid.sql\n",
+    "status_client = druid.status\n",
+    "\n",
+    "status_client.version"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ingest nested data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the following cell to ingest sample clickstream data from the [Koalas 
to the Max](https://www.koalastothemax.com/) game."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "INSERT INTO example_koalas_nesteddata\n",
+    "    WITH \"source\" AS\n",
+    "    (SELECT * FROM 
TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
+    "       
'{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"client_ip\",\"type\":\"string\"},\n",
+    "        
{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},{\"name\":\"event\",\"type\":\"COMPLEX<json>\"},\n",
+    "        
{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
+    "    SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+    "    \"client_ip\", \n",
+    "    \"session\", \n",
+    "    \"session_length\", \n",
+    "    \"event\", \n",
+    "    \"agent\", \n",
+    "    \"geo_ip\"\n",
+    "    FROM \"source\"\n",
+    "    PARTITIONED BY DAY\n",
+    "'''\n",
+    "\n",
+    "sql_client.run_task(sql)\n",
+    "sql_client.wait_until_ready(\"example_koalas_nesteddata\")\n",
+    "display.table(\"example_koalas_nesteddata\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Druid reports task completion as soon as ingestion is done. However, it 
takes a while for Druid to load the resulting segments.\n",
+    "\n",
+    "Wait for the table detail to display, then run the following cell to 
query the data and return selected columns from 3 rows. Note the nested 
structure of the `event`, `agent`, and `geo_ip` columns."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT session, event, agent, geo_ip \n",
+    "FROM example_koalas_nesteddata LIMIT 3\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Transform nested data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can use Druid's [SQL JSON 
functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
 to transform nested data in your ingestion query.\n",
+    "\n",
+    "Run the following cell to insert sample data into a new datasource named 
`example_koalas_nesteddata_transform`. The SELECT query extracts the `country` 
and `city` elements from the nested `geo_ip` column and creates a composite 
object `sessionDetails` containing  `session` and `session_length`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "INSERT INTO example_koalas_nesteddata_transform\n",
+    "    WITH \"source\" AS\n",
+    "    (SELECT * FROM 
TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
+    "       
'{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},\n",
+    "        
{\"name\":\"event\",\"type\":\"COMPLEX<json>\"},{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
+    "        SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+    "        JSON_QUERY(geo_ip, '$.country') as country,\n",
+    "        JSON_QUERY(geo_ip, '$.city') as city,\n",
+    "        JSON_OBJECT('session':session, 'session_length':session_length) 
as sessionDetails\n",
+    "    FROM \"source\"\n",
+    "    PARTITIONED BY DAY\n",
+    "'''\n",
+    "\n",
+    "sql_client.run_task(sql)\n",
+    "sql_client.wait_until_ready(\"example_koalas_nesteddata_transform\")\n",
+    "display.table(\"example_koalas_nesteddata_transform\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When the table detail displays, run the following cell to query the data 
and return `country`, `city`, and `sessionDetails` from 3 rows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT country, city, sessionDetails \n",
+    "FROM example_koalas_nesteddata_transform \n",
+    "LIMIT 3\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Query nested data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the following cell to display the data types for columns in the 
`example_koalas_nesteddata` datasource. Note that nested columns  display as 
`COMPLEX<json>`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE\n",
+    "FROM INFORMATION_SCHEMA.COLUMNS\n",
+    "WHERE TABLE_NAME = 'example_koalas_nesteddata'\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can use 
[`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
 to extract specific elements from a `COMPLEX<json>` object.\n",
+    "    \n",
+    "Run the following cell to extract `continent` from `geo_ip` and 
`category` from `agent` for 3 rows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT JSON_VALUE(geo_ip, '$.continent') as continent,\n",
+    "JSON_VALUE(agent, '$.category') as category\n",
+    "FROM example_koalas_nesteddata LIMIT 3\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Group, filter, and aggregate nested data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the following cell to see how you can use the SELECT COUNT(DISTINCT) 
operator with `JSON_VALUE`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as \"Number of 
cities\"\n",
+    "FROM example_koalas_nesteddata\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run the following cell to filter and group a query using `JSON_VALUE`. 
The query selects the `browser` element from the `agent` column and the 
`country` and `city` elements from the `geo_ip` column, for all rows with city 
`Helsinki`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT JSON_VALUE(agent, '$.browser') as browser,\n",
+    "JSON_VALUE(geo_ip, '$.country') as country,\n",
+    "JSON_VALUE(geo_ip, '$.city') as city\n",
+    "FROM example_koalas_nesteddata\n",
+    "WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki')\n",
+    "GROUP BY 1,2,3\n",
+    "ORDER BY 1\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Use helper operators"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can use SQL helper operators such as 
[`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
 and 
[`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
 to examine nested data and plan your queries. Run the following cell to return 
an array of field names and an array of paths for the `geo_ip` nested column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as \"geo_ip 
keys\",\n",
+    "ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as \"geo_ip paths\"\n",
+    "FROM example_koalas_nesteddata\n",
+    "'''\n",
+    "resp = sql_client.sql_query(sql)\n",
+    "resp.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Learn more"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This tutorial covers the basics of working with nested data. To learn 
more about nested data in Druid and related Druid features, see the following 
topics:\n",
+    "\n",
+    "- [Nested 
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for 
information about the nested columns feature, with ingestion and query 
examples. \n",
+    "- [SQL JSON 
functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
 for details on all of the functions you used in this tutorial.\n",
+    "- [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) 
for information on how to use Druid SQL-based ingestion."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.3"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to