This is an automated email from the ASF dual-hosted git repository.
techdocsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new 2561477e87 Jupyter nested columns tutorial (#14788)
2561477e87 is described below
commit 2561477e875e3a8161f45e291f2f961496b23f6e
Author: Jill Osborne <[email protected]>
AuthorDate: Wed Aug 16 22:45:37 2023 +0100
Jupyter nested columns tutorial (#14788)
---
.../02-working-with-nested-columns.ipynb | 426 +++++++++++++++++++++
1 file changed, 426 insertions(+)
diff --git
a/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb
b/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb
new file mode 100644
index 0000000000..f4538607c4
--- /dev/null
+++
b/examples/quickstart/jupyter-notebooks/notebooks/02-ingestion/02-working-with-nested-columns.ipynb
@@ -0,0 +1,426 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Working with nested columns\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "This tutorial demonstrates how to work with [nested
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) in
Apache Druid.\n",
+ "\n",
+ "Druid stores nested data structures in `COMPLEX<json>` columns. In this
tutorial you perform the following tasks:\n",
+ "\n",
+ "- Ingest nested JSON data using SQL-based ingestion.\n",
+ "- Transform nested data during ingestion using SQL JSON functions.\n",
+ "- Perform queries to display, filter, and aggregate nested data.\n",
+ "- Use helper operators to examine nested data and plan your queries.\n",
+ "\n",
+ "Druid supports directly ingesting nested data with the following formats:
JSON, Parquet, Avro, ORC, Protobuf."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Initialization](#Initialization)\n",
+ "- [Ingest nested data](#Ingest-nested-data)\n",
+ "- [Transform nested data](#Transform-nested-data)\n",
+ "- [Query nested data](#Query-nested-data)\n",
+ "- [Group, filter, and aggregate nested
data](#Group-filter-and-aggregate-nested-data)\n",
+ "- [Use helper operators](#Use-helper-operators)\n",
+ "- [Learn more](#Learn-more)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 25.0.0 or later.\n",
+ "\n",
+ "### Run with Docker\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`
profile of the Docker Compose file for Jupyter-based Druid tutorials. For more
information, see [Docker for Jupyter Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "### Run without Docker\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Apache Druid instance, with a `DRUID_HOST` local environment
variable containing the server name of your Druid router.\n",
+ "*
[druidapi](https://github.com/apache/druid/blob/master/examples/quickstart/jupyter-notebooks/druidapi/README.md),
a Python client for Apache Druid. Follow the instructions in the Install
section of the README file."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Initialization"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Run the next cell to set up the Druid Python client's connection to
Apache Druid.\n",
+ "\n",
+ "If successful, the Druid version number will be shown in the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import druidapi\n",
+ "import os\n",
+ "\n",
+ "if 'DRUID_HOST' not in os.environ.keys():\n",
+ " druid_host=f\"http://localhost:8888\"\n",
+ "else:\n",
+ " druid_host=f\"http://{os.environ['DRUID_HOST']}:8888\"\n",
+ " \n",
+ "print(f\"Opening a connection to {druid_host}.\")\n",
+ "druid = druidapi.jupyter_client(druid_host)\n",
+ "\n",
+ "display = druid.display\n",
+ "sql_client = druid.sql\n",
+ "status_client = druid.status\n",
+ "\n",
+ "status_client.version"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Ingest nested data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Run the following cell to ingest sample clickstream data from the [Koalas
to the Max](https://www.koalastothemax.com/) game."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "INSERT INTO example_koalas_nesteddata\n",
+ " WITH \"source\" AS\n",
+ " (SELECT * FROM
TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
+ "
'{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"client_ip\",\"type\":\"string\"},\n",
+ "
{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},{\"name\":\"event\",\"type\":\"COMPLEX<json>\"},\n",
+ "
{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
+ " SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " \"client_ip\", \n",
+ " \"session\", \n",
+ " \"session_length\", \n",
+ " \"event\", \n",
+ " \"agent\", \n",
+ " \"geo_ip\"\n",
+ " FROM \"source\"\n",
+ " PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready(\"example_koalas_nesteddata\")\n",
+ "display.table(\"example_koalas_nesteddata\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Druid reports task completion as soon as ingestion is done. However, it
takes a while for Druid to load the resulting segments.\n",
+ "\n",
+ "Wait for the table detail to display, then run the following cell to
query the data and return selected columns from 3 rows. Note the nested
structure of the `event`, `agent`, and `geo_ip` columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT session, event, agent, geo_ip \n",
+ "FROM example_koalas_nesteddata LIMIT 3\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Transform nested data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can use Druid's [SQL JSON
functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
to transform nested data in your ingestion query.\n",
+ "\n",
+ "Run the following cell to insert sample data into a new datasource named
`example_koalas_nesteddata_transform`. The SELECT query extracts the `country`
and `city` elements from the nested `geo_ip` column and creates a composite
object `sessionDetails` containing `session` and `session_length`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "INSERT INTO example_koalas_nesteddata_transform\n",
+ " WITH \"source\" AS\n",
+ " (SELECT * FROM
TABLE(EXTERN('{\"type\":\"http\",\"uris\":[\"https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz\"]}',\n",
+ "
'{\"type\":\"json\"}','[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"string\"},\n",
+ "
{\"name\":\"event\",\"type\":\"COMPLEX<json>\"},{\"name\":\"agent\",\"type\":\"COMPLEX<json>\"},{\"name\":\"geo_ip\",\"type\":\"COMPLEX<json>\"}]')))\n",
+ " SELECT TIME_PARSE(\"timestamp\") AS \"__time\",\n",
+ " JSON_QUERY(geo_ip, '$.country') as country,\n",
+ " JSON_QUERY(geo_ip, '$.city') as city,\n",
+ " JSON_OBJECT('session':session, 'session_length':session_length)
as sessionDetails\n",
+ " FROM \"source\"\n",
+ " PARTITIONED BY DAY\n",
+ "'''\n",
+ "\n",
+ "sql_client.run_task(sql)\n",
+ "sql_client.wait_until_ready(\"example_koalas_nesteddata_transform\")\n",
+ "display.table(\"example_koalas_nesteddata_transform\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "When the table detail displays, run the following cell to query the data
and return `country`, `city`, and `sessionDetails` from 3 rows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT country, city, sessionDetails \n",
+ "FROM example_koalas_nesteddata_transform \n",
+ "LIMIT 3\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Query nested data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Run the following cell to display the data types for columns in the
`example_koalas_nesteddata` datasource. Note that nested columns display as
`COMPLEX<json>`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE\n",
+ "FROM INFORMATION_SCHEMA.COLUMNS\n",
+ "WHERE TABLE_NAME = 'example_koalas_nesteddata'\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can use
[`JSON_VALUE`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
to extract specific elements from a `COMPLEX<json>` object.\n",
+ " \n",
+ "Run the following cell to extract `continent` from `geo_ip` and
`category` from `agent` for 3 rows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT JSON_VALUE(geo_ip, '$.continent') as continent,\n",
+ "JSON_VALUE(agent, '$.category') as category\n",
+ "FROM example_koalas_nesteddata LIMIT 3\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Group, filter, and aggregate nested data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Run the following cell to see how you can use the SELECT COUNT(DISTINCT)
operator with `JSON_VALUE`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT COUNT(DISTINCT(JSON_VALUE(geo_ip, '$.city'))) as \"Number of
cities\"\n",
+ "FROM example_koalas_nesteddata\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Run the following cell to filter and group a query using `JSON_VALUE`.
The query selects the `browser` element from the `agent` column and the
`country` and `city` elements from the `geo_ip` column, for all rows with city
`Helsinki`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT JSON_VALUE(agent, '$.browser') as browser,\n",
+ "JSON_VALUE(geo_ip, '$.country') as country,\n",
+ "JSON_VALUE(geo_ip, '$.city') as city\n",
+ "FROM example_koalas_nesteddata\n",
+ "WHERE JSON_VALUE(geo_ip, '$.city') in ('Helsinki')\n",
+ "GROUP BY 1,2,3\n",
+ "ORDER BY 1\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Use helper operators"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can use SQL helper operators such as
[`JSON_KEYS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
and
[`JSON_PATHS`](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
to examine nested data and plan your queries. Run the following cell to return
an array of field names and an array of paths for the `geo_ip` nested column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "SELECT ARRAY_CONCAT_AGG(DISTINCT JSON_KEYS(geo_ip, '$.')) as \"geo_ip
keys\",\n",
+ "ARRAY_CONCAT_AGG(DISTINCT JSON_PATHS(geo_ip)) as \"geo_ip paths\"\n",
+ "FROM example_koalas_nesteddata\n",
+ "'''\n",
+ "resp = sql_client.sql_query(sql)\n",
+ "resp.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Learn more"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This tutorial covers the basics of working with nested data. To learn
more about nested data in Druid and related Druid features, see the following
topics:\n",
+ "\n",
+ "- [Nested
columns](https://druid.apache.org/docs/latest/querying/nested-columns.html) for
information about the nested columns feature, with ingestion and query
examples. \n",
+ "- [SQL JSON
functions](https://druid.apache.org/docs/latest/querying/sql-json-functions.html)
for details on all of the functions you used in this tutorial.\n",
+ "- [SQL-based
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html)
for information on how to use Druid SQL-based ingestion."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.3"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "a4289e5b8bae5973a6609d90f7bc464162478362b9a770893a3c5c597b0b36e7"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]