[GitHub] [druid] vtlim commented on a diff in pull request #14410: Releasenote notebooks 26

via GitHub Mon, 12 Jun 2023 15:08:15 -0700


vtlim commented on code in PR #14410:
URL: https://github.com/apache/druid/pull/14410#discussion_r1227283368



##########
examples/quickstart/releases/README.md:
##########
@@ -0,0 +1,27 @@
+# Jupyter Notebook tutorials for Druid
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+This directory contains notebook-based release notes that contain examples on 
how to use new features.
+Notebooks in this directory are meant to be run against quick-start clusters, 
but you can adapt them to run against live production clusters.

Review Comment:
   ```suggestion
   Notebooks in this directory are meant to be run against quickstart clusters, 
but you can adapt them to run against live production clusters.
   ```



##########
examples/quickstart/releases/Druid26.ipynb:
##########
@@ -0,0 +1,366 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e4a4ffd8-8aa5-4b6e-b60a-f4ef14049c46",
+   "metadata": {},
+   "source": [
+    "## Druid 26.0 release notebook"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3a008975-3100-417b-8ddc-623857d5ad6a",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "You'll need the following dependencies:\n",
+    "\n",
+    "pandas, requests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18cc6a82-0167-423c-b14d-01c36ac2733d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# What's the current version of Druid?\n",
+    "import requests\n",
+    "\n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "session = requests.Session()\n",
+    "endpoint = druid_host + '/status'\n",
+    "response = session.get(endpoint)\n",
+    "json = response.json()\n",
+    "print(\"Running on Druid version: \"+ json[\"version\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c39b6caf-e08a-41c0-9021-12ee270023c1",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Schema auto-discovery\n",
+    "\n",
+    "### What would happen in the past if we just load this data?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee16e5bc-7e7a-4da5-9816-99d161100522",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from IPython.display import JSON\n",
+    "ingestion_spec = {\n",
+    "  \"type\": \"index_parallel\",\n",
+    "  \"spec\": {\n",
+    "    \"ioConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"inputSource\": {\n",
+    "        \"type\": \"http\",\n",
+    "        \"uris\": 
[\"https://druid.apache.org/data/wikipedia.json.gz\"],\n";,
+    "        \"filter\": \"*\"\n",
+    "      },\n",
+    "      \"inputFormat\": {\n",
+    "        \"type\": \"json\"\n",
+    "      }\n",
+    "    },\n",
+    "    \"tuningConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"partitionsSpec\": {\n",
+    "        \"type\": \"dynamic\"\n",
+    "      },\n",
+    "      \"indexSpec\": {\n",
+    "        \"stringDictionaryEncoding\": {\n",
+    "          \"type\": \"frontCoded\",\n",
+    "          \"bucketSize\": 16\n",
+    "        }\n",
+    "      }\n",
+    "    },\n",
+    "    \"dataSchema\": {\n",
+    "      \"dataSource\": \"wikipedia\",\n",
+    "      \"timestampSpec\": {\n",
+    "        \"missingValue\": \"2010-01-01T00:00:00Z\"\n",
+    "      },\n",
+    "      \"dimensionsSpec\": {\n",
+    "        \"dimensions\": [],\n",
+    "        \"dimensionExclusions\": [],\n",
+    "        \"spatialDimensions\": [],\n",
+    "        \"useSchemaDiscovery\": True\n",
+    "      },\n",
+    "      \"granularitySpec\": {\n",
+    "        \"queryGranularity\": \"none\",\n",
+    "        \"rollup\": False\n",
+    "      }\n",
+    "    }\n",
+    "  }\n",
+    "}\n",
+    "\n",
+    "JSON(ingestion_spec,expanded=True)\n",
+    "\n",
+    "endpoint = druid_host + '/druid/indexer/v1/task/'\n",
+    "response = session.post(endpoint,json = ingestion_spec)\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2617af1b",
+   "metadata": {},
+   "source": [
+    "Note that because we've set `\"useSchemaDiscovery\": True` in the 
ingestion spec, even though we didn't specify any data types for the columns, 
they are correctly inferred. Look at the code example below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7d3bc513-8215-4299-9bf4-135ec65cae98",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "sql = '''\n",
+    "SELECT *\n",
+    "FROM \"INFORMATION_SCHEMA\".\"COLUMNS\"\n",
+    "WHERE  \"TABLE_NAME\" = 'wikipedia'\n",
+    "'''\n",
+    "sql_request = {'query': sql}\n",
+    "json_data = session.post(endpoint, json=sql_request).json()\n",
+    "result_df = pd.json_normalize(json_data)\n",
+    "result_df.head()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "483c67d7",
+   "metadata": {},
+   "source": [
+    "As you can see, in `DATA_TYPE` column, different data types are correctly 
detected and not everything are stored as `strings`."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "08a3b808-e138-47c7-b7f1-e3a6c9f3bad3",
+   "metadata": {},
+   "source": [
+    "# Shuffle join\n",
+    "\n",
+    "### Make it really easy to denormalize data as part of ingestion\n",
+    "Before the support of shuffle join, you'll need to use another tool to 
prepare the data then ingest into Druid. With shuffle join support, you can do 
the same transformation with one query."

Review Comment:
   Could also add here what the user is about to submit. For example, "The 
following query ingests data ... (and what kind of denormalization is being 
applied)"



##########
examples/quickstart/releases/Druid26.ipynb:
##########
@@ -0,0 +1,366 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e4a4ffd8-8aa5-4b6e-b60a-f4ef14049c46",
+   "metadata": {},
+   "source": [
+    "## Druid 26.0 release notebook"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3a008975-3100-417b-8ddc-623857d5ad6a",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "You'll need the following dependencies:\n",
+    "\n",
+    "pandas, requests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18cc6a82-0167-423c-b14d-01c36ac2733d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# What's the current version of Druid?\n",
+    "import requests\n",
+    "\n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "session = requests.Session()\n",
+    "endpoint = druid_host + '/status'\n",
+    "response = session.get(endpoint)\n",
+    "json = response.json()\n",
+    "print(\"Running on Druid version: \"+ json[\"version\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c39b6caf-e08a-41c0-9021-12ee270023c1",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Schema auto-discovery\n",
+    "\n",
+    "### What would happen in the past if we just load this data?\n"

Review Comment:
   It could be helpful to add more explanation before and after the cells. For 
example, to explain what happened before schema auto-discovery was introduced, 
and also explain to the user what they're about to do (submit an ingestion job 
to ingest from wikipedia data). Could also link to the docs.



##########
examples/quickstart/releases/Druid26.ipynb:
##########
@@ -0,0 +1,366 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e4a4ffd8-8aa5-4b6e-b60a-f4ef14049c46",
+   "metadata": {},
+   "source": [
+    "## Druid 26.0 release notebook"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3a008975-3100-417b-8ddc-623857d5ad6a",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "You'll need the following dependencies:\n",
+    "\n",
+    "pandas, requests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18cc6a82-0167-423c-b14d-01c36ac2733d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# What's the current version of Druid?\n",
+    "import requests\n",
+    "\n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "session = requests.Session()\n",
+    "endpoint = druid_host + '/status'\n",
+    "response = session.get(endpoint)\n",
+    "json = response.json()\n",
+    "print(\"Running on Druid version: \"+ json[\"version\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c39b6caf-e08a-41c0-9021-12ee270023c1",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Schema auto-discovery\n",
+    "\n",
+    "### What would happen in the past if we just load this data?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee16e5bc-7e7a-4da5-9816-99d161100522",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from IPython.display import JSON\n",
+    "ingestion_spec = {\n",
+    "  \"type\": \"index_parallel\",\n",
+    "  \"spec\": {\n",
+    "    \"ioConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"inputSource\": {\n",
+    "        \"type\": \"http\",\n",
+    "        \"uris\": 
[\"https://druid.apache.org/data/wikipedia.json.gz\"],\n";,
+    "        \"filter\": \"*\"\n",
+    "      },\n",
+    "      \"inputFormat\": {\n",
+    "        \"type\": \"json\"\n",
+    "      }\n",
+    "    },\n",
+    "    \"tuningConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"partitionsSpec\": {\n",
+    "        \"type\": \"dynamic\"\n",
+    "      },\n",
+    "      \"indexSpec\": {\n",
+    "        \"stringDictionaryEncoding\": {\n",
+    "          \"type\": \"frontCoded\",\n",
+    "          \"bucketSize\": 16\n",
+    "        }\n",
+    "      }\n",
+    "    },\n",
+    "    \"dataSchema\": {\n",
+    "      \"dataSource\": \"wikipedia\",\n",
+    "      \"timestampSpec\": {\n",
+    "        \"missingValue\": \"2010-01-01T00:00:00Z\"\n",
+    "      },\n",
+    "      \"dimensionsSpec\": {\n",
+    "        \"dimensions\": [],\n",
+    "        \"dimensionExclusions\": [],\n",
+    "        \"spatialDimensions\": [],\n",
+    "        \"useSchemaDiscovery\": True\n",
+    "      },\n",
+    "      \"granularitySpec\": {\n",
+    "        \"queryGranularity\": \"none\",\n",
+    "        \"rollup\": False\n",
+    "      }\n",
+    "    }\n",
+    "  }\n",
+    "}\n",
+    "\n",
+    "JSON(ingestion_spec,expanded=True)\n",
+    "\n",
+    "endpoint = druid_host + '/druid/indexer/v1/task/'\n",
+    "response = session.post(endpoint,json = ingestion_spec)\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2617af1b",
+   "metadata": {},
+   "source": [
+    "Note that because we've set `\"useSchemaDiscovery\": True` in the 
ingestion spec, even though we didn't specify any data types for the columns, 
they are correctly inferred. Look at the code example below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7d3bc513-8215-4299-9bf4-135ec65cae98",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "sql = '''\n",
+    "SELECT *\n",
+    "FROM \"INFORMATION_SCHEMA\".\"COLUMNS\"\n",
+    "WHERE  \"TABLE_NAME\" = 'wikipedia'\n",
+    "'''\n",
+    "sql_request = {'query': sql}\n",
+    "json_data = session.post(endpoint, json=sql_request).json()\n",
+    "result_df = pd.json_normalize(json_data)\n",
+    "result_df.head()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "483c67d7",
+   "metadata": {},
+   "source": [
+    "As you can see, in `DATA_TYPE` column, different data types are correctly 
detected and not everything are stored as `strings`."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "08a3b808-e138-47c7-b7f1-e3a6c9f3bad3",
+   "metadata": {},
+   "source": [
+    "# Shuffle join\n",
+    "\n",
+    "### Make it really easy to denormalize data as part of ingestion\n",
+    "Before the support of shuffle join, you'll need to use another tool to 
prepare the data then ingest into Druid. With shuffle join support, you can do 
the same transformation with one query."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0dc81a51-0160-4cd6-bd97-6abf60a6e7d6",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "query = '''\n",
+    "REPLACE INTO \"wikipedia\" OVERWRITE ALL\n",
+    "WITH \"wikipedia_main\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\" 
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR, 
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\" 
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT, 
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\" 
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR, 
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR, 
\"regionName\" VARCHAR))\n",
+    ",\n",
+    "\"wikipedia_dim\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\" 
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR, 
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\" 
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT, 
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\" 
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR, 
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR, 
\"regionName\" VARCHAR))\n",
+    "\n",
+    "\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"wikipedia_main\".\"timestamp\") AS \"__time\",\n",
+    "  \"wikipedia_main\".*\n",
+    "FROM \"wikipedia_main\"\n",
+    "LEFT JOIN \"wikipedia_dim\" \n",
+    "ON \"wikipedia_main\".\"channel\" = \"wikipedia_dim\".\"channel\"\n",
+    "WHERE \"wikipedia_main\".\"timestamp\" >='2016-01-01'\n",
+    "PARTITIONED BY MONTH\n",
+    "'''"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e10df053-2729-4e2c-ac4a-3c8d0c070dc0",
+   "metadata": {},
+   "source": [
+    "### Let's watch the ingestion task running..."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9d302e43-9f14-4d19-b286-7a3cbc448470",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "sql_request={'query': query}\n",
+    "endpoint = druid_host + '/druid/v2/sql/task'\n",
+    "response = session.post(endpoint, json=sql_request)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 36,
+   "id": "eadf05f7-bc0a-4a29-981d-d8bc5fd72314",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "..............................."
+     ]
+    }
+   ],
+   "source": [
+    "ingestion_taskId = response.json()['taskId']\n",
+    "endpoint = druid_host + 
f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+    "import time\n",
+    "\n",
+    "json = session.get(endpoint).json()\n",
+    "ingestion_status = json['status']['status']\n",
+    " \n",
+    "if ingestion_status == \"RUNNING\":\n",
+    "    print(\"The ingestion is running...\")\n",
+    "\n",
+    "while ingestion_status != \"SUCCESS\":\n",
+    "    time.sleep(1)\n",
+    "    json = session.get(endpoint).json()\n",
+    "    ingestion_status = json['status']['status']\n",
+    "    print('.', end='')\n",
+    "\n",
+    "if ingestion_status == \"SUCCESS\": \n",
+    "    print(\"\\nThe ingestion is complete\")\n",
+    "else:\n",
+    "    print(\"\\nThe ingestion task failed:\", json)\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "10417469-b2f7-4a56-bd4f-fddc0277c3c9",
+   "metadata": {},
+   "source": [
+    "### Note I didn't use any other tools, this is all done within Druid. No 
need for using Spark/Presto for data prep"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "7b134ef2-e3ef-4345-94c8-64cf36f6adfe",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## UNNEST and Arrays\n",
+    "\n",
+    "UNNEST is useful to deal with Array data and allows you to \"explode\" an 
array into individual rows"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "434602dd-d62b-476f-b18f-4a3fa23ff70e",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [

Review Comment:
   Could be helpful to explain the results a user should expect. This also 
means someone who's perusing but not running the notebook can also follow along 
with the demo of release highlights.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] vtlim commented on a diff in pull request #14410: Releasenote notebooks 26

Reply via email to