[GitHub] [druid] 2bethere commented on a diff in pull request #14410: Releasenote notebooks 26

via GitHub Mon, 12 Jun 2023 22:16:37 -0700


2bethere commented on code in PR #14410:
URL: https://github.com/apache/druid/pull/14410#discussion_r1227540590



##########
examples/quickstart/releases/Druid26.ipynb:
##########
@@ -0,0 +1,366 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e4a4ffd8-8aa5-4b6e-b60a-f4ef14049c46",
+   "metadata": {},
+   "source": [
+    "## Druid 26.0 release notebook"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3a008975-3100-417b-8ddc-623857d5ad6a",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "You'll need the following dependencies:\n",
+    "\n",
+    "pandas, requests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18cc6a82-0167-423c-b14d-01c36ac2733d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# What's the current version of Druid?\n",
+    "import requests\n",
+    "\n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "session = requests.Session()\n",
+    "endpoint = druid_host + '/status'\n",
+    "response = session.get(endpoint)\n",
+    "json = response.json()\n",
+    "print(\"Running on Druid version: \"+ json[\"version\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c39b6caf-e08a-41c0-9021-12ee270023c1",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Schema auto-discovery\n",
+    "\n",
+    "### What would happen in the past if we just load this data?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee16e5bc-7e7a-4da5-9816-99d161100522",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from IPython.display import JSON\n",
+    "ingestion_spec = {\n",
+    "  \"type\": \"index_parallel\",\n",
+    "  \"spec\": {\n",
+    "    \"ioConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"inputSource\": {\n",
+    "        \"type\": \"http\",\n",
+    "        \"uris\": 
[\"https://druid.apache.org/data/wikipedia.json.gz\"],\n";,
+    "        \"filter\": \"*\"\n",
+    "      },\n",
+    "      \"inputFormat\": {\n",
+    "        \"type\": \"json\"\n",
+    "      }\n",
+    "    },\n",
+    "    \"tuningConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"partitionsSpec\": {\n",
+    "        \"type\": \"dynamic\"\n",
+    "      },\n",
+    "      \"indexSpec\": {\n",
+    "        \"stringDictionaryEncoding\": {\n",
+    "          \"type\": \"frontCoded\",\n",
+    "          \"bucketSize\": 16\n",
+    "        }\n",
+    "      }\n",
+    "    },\n",
+    "    \"dataSchema\": {\n",
+    "      \"dataSource\": \"wikipedia\",\n",
+    "      \"timestampSpec\": {\n",
+    "        \"missingValue\": \"2010-01-01T00:00:00Z\"\n",
+    "      },\n",
+    "      \"dimensionsSpec\": {\n",
+    "        \"dimensions\": [],\n",
+    "        \"dimensionExclusions\": [],\n",
+    "        \"spatialDimensions\": [],\n",
+    "        \"useSchemaDiscovery\": True\n",
+    "      },\n",
+    "      \"granularitySpec\": {\n",
+    "        \"queryGranularity\": \"none\",\n",
+    "        \"rollup\": False\n",
+    "      }\n",
+    "    }\n",
+    "  }\n",
+    "}\n",
+    "\n",
+    "JSON(ingestion_spec,expanded=True)\n",
+    "\n",
+    "endpoint = druid_host + '/druid/indexer/v1/task/'\n",
+    "response = session.post(endpoint,json = ingestion_spec)\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "2617af1b",
+   "metadata": {},
+   "source": [
+    "Note that because we've set `\"useSchemaDiscovery\": True` in the 
ingestion spec, even though we didn't specify any data types for the columns, 
they are correctly inferred. Look at the code example below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7d3bc513-8215-4299-9bf4-135ec65cae98",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "sql = '''\n",
+    "SELECT *\n",
+    "FROM \"INFORMATION_SCHEMA\".\"COLUMNS\"\n",
+    "WHERE  \"TABLE_NAME\" = 'wikipedia'\n",
+    "'''\n",
+    "sql_request = {'query': sql}\n",
+    "json_data = session.post(endpoint, json=sql_request).json()\n",
+    "result_df = pd.json_normalize(json_data)\n",
+    "result_df.head()"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "483c67d7",
+   "metadata": {},
+   "source": [
+    "As you can see, in `DATA_TYPE` column, different data types are correctly 
detected and not everything are stored as `strings`."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "08a3b808-e138-47c7-b7f1-e3a6c9f3bad3",
+   "metadata": {},
+   "source": [
+    "# Shuffle join\n",
+    "\n",
+    "### Make it really easy to denormalize data as part of ingestion\n",
+    "Before the support of shuffle join, you'll need to use another tool to 
prepare the data then ingest into Druid. With shuffle join support, you can do 
the same transformation with one query."

Review Comment:
   OK, added a section.



##########
examples/quickstart/releases/Druid26.ipynb:
##########
@@ -0,0 +1,366 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "e4a4ffd8-8aa5-4b6e-b60a-f4ef14049c46",
+   "metadata": {},
+   "source": [
+    "## Druid 26.0 release notebook"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "3a008975-3100-417b-8ddc-623857d5ad6a",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "You'll need the following dependencies:\n",
+    "\n",
+    "pandas, requests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18cc6a82-0167-423c-b14d-01c36ac2733d",
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# What's the current version of Druid?\n",
+    "import requests\n",
+    "\n",
+    "druid_host = \"http://localhost:8888\"\n";,
+    "session = requests.Session()\n",
+    "endpoint = druid_host + '/status'\n",
+    "response = session.get(endpoint)\n",
+    "json = response.json()\n",
+    "print(\"Running on Druid version: \"+ json[\"version\"])"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "c39b6caf-e08a-41c0-9021-12ee270023c1",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Schema auto-discovery\n",
+    "\n",
+    "### What would happen in the past if we just load this data?\n"

Review Comment:
   Awesome suggestion, adding.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] 2bethere commented on a diff in pull request #14410: Releasenote notebooks 26

Reply via email to