vtlim commented on code in PR #14410:
URL: https://github.com/apache/druid/pull/14410#discussion_r1232763740
##########
examples/quickstart/releases/Druid26.ipynb:
##########
@@ -0,0 +1,366 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "e4a4ffd8-8aa5-4b6e-b60a-f4ef14049c46",
+ "metadata": {},
+ "source": [
+ "## Druid 26.0 release notebook"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "3a008975-3100-417b-8ddc-623857d5ad6a",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "You'll need the following dependencies:\n",
+ "\n",
+ "pandas, requests"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18cc6a82-0167-423c-b14d-01c36ac2733d",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "# What's the current version of Druid?\n",
+ "import requests\n",
+ "\n",
+ "druid_host = \"http://localhost:8888\"\n",
+ "session = requests.Session()\n",
+ "endpoint = druid_host + '/status'\n",
+ "response = session.get(endpoint)\n",
+ "json = response.json()\n",
+ "print(\"Running on Druid version: \"+ json[\"version\"])"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "c39b6caf-e08a-41c0-9021-12ee270023c1",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "## Schema auto-discovery\n",
+ "\n",
+ "### What would happen in the past if we just load this data?\n",
+ "\n",
+ "Previously, Druid already supports [string based schema
auto-discovery](https://druid.apache.org/docs/latest/ingestion/schema-design.html#string-based-schema-discovery),
but it has some limitations. Specifically, all the newly discovered columns
will be stored as string types. This means aggregation queries on numerical
columns can be slow (since they needs to be parsed as numbers first), and some
fields such as mutli-value dimensions with null values can misbehave.\n",
+ "\n",
+ "With introduction of [type-aware schema
auto-discovery](https://druid.apache.org/docs/latest/ingestion/schema-design.html#type-aware-schema-discovery),
Druid now properly inferrs data types. And in the example below, you can see
columns being automatically inferred as long, float, and string types, etc."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ee16e5bc-7e7a-4da5-9816-99d161100522",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "from IPython.display import JSON\n",
+ "ingestion_spec = {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"http\",\n",
+ " \"uris\":
[\"https://druid.apache.org/data/wikipedia.json.gz\"],\n",
+ " \"filter\": \"*\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " }\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"partitionsSpec\": {\n",
+ " \"type\": \"dynamic\"\n",
+ " },\n",
+ " \"indexSpec\": {\n",
+ " \"stringDictionaryEncoding\": {\n",
+ " \"type\": \"frontCoded\",\n",
+ " \"bucketSize\": 16\n",
+ " }\n",
+ " }\n",
+ " },\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia\",\n",
+ " \"timestampSpec\": {\n",
+ " \"missingValue\": \"2010-01-01T00:00:00Z\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"dimensions\": [],\n",
+ " \"dimensionExclusions\": [],\n",
+ " \"spatialDimensions\": [],\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"granularitySpec\": {\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " }\n",
+ " }\n",
+ "}\n",
+ "\n",
+ "JSON(ingestion_spec,expanded=True)\n",
+ "\n",
+ "endpoint = druid_host + '/druid/indexer/v1/task/'\n",
+ "response = session.post(endpoint,json = ingestion_spec)\n"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "2617af1b",
+ "metadata": {},
+ "source": [
+ "Note that because we've set `\"useSchemaDiscovery\": True` in the
ingestion spec, even though we didn't specify any data types for the columns,
they are correctly inferred. Look at the code example below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7d3bc513-8215-4299-9bf4-135ec65cae98",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "sql = '''\n",
+ "SELECT *\n",
+ "FROM \"INFORMATION_SCHEMA\".\"COLUMNS\"\n",
+ "WHERE \"TABLE_NAME\" = 'wikipedia'\n",
+ "'''\n",
+ "sql_request = {'query': sql}\n",
+ "json_data = session.post(endpoint, json=sql_request).json()\n",
+ "result_df = pd.json_normalize(json_data)\n",
+ "result_df.head()"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "483c67d7",
+ "metadata": {},
+ "source": [
+ "As you can see, in `DATA_TYPE` column, different data types are correctly
detected and not everything are stored as `strings`."
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "08a3b808-e138-47c7-b7f1-e3a6c9f3bad3",
+ "metadata": {},
+ "source": [
+ "# Shuffle join\n",
+ "\n",
+ "### Make it really easy to denormalize data as part of ingestion\n",
+ "Before the support of shuffle join, you'll need to use another tool to
prepare the data then ingest into Druid. With shuffle join support, you can do
the same transformation with one query.\n",
+ "For example, in the query below, the user does a self-join on the
wikipedia dataset. You can easily do the same query with a typical star-schema
dataset. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0dc81a51-0160-4cd6-bd97-6abf60a6e7d6",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "query = '''\n",
+ "REPLACE INTO \"wikipedia\" OVERWRITE ALL\n",
+ "WITH \"wikipedia_main\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ ",\n",
+ "\"wikipedia_dim\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"http\",\"uris\":[\"https://druid.apache.org/data/wikipedia.json.gz\"]}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"isRobot\" VARCHAR, \"channel\" VARCHAR, \"timestamp\"
VARCHAR, \"flags\" VARCHAR, \"isUnpatrolled\" VARCHAR, \"page\" VARCHAR,
\"diffUrl\" VARCHAR, \"added\" BIGINT, \"comment\" VARCHAR, \"commentLength\"
BIGINT, \"isNew\" VARCHAR, \"isMinor\" VARCHAR, \"delta\" BIGINT,
\"isAnonymous\" VARCHAR, \"user\" VARCHAR, \"deltaBucket\" BIGINT, \"deleted\"
BIGINT, \"namespace\" VARCHAR, \"cityName\" VARCHAR, \"countryName\" VARCHAR,
\"regionIsoCode\" VARCHAR, \"metroCode\" BIGINT, \"countryIsoCode\" VARCHAR,
\"regionName\" VARCHAR))\n",
+ "\n",
+ "\n",
+ "SELECT\n",
+ " TIME_PARSE(\"wikipedia_main\".\"timestamp\") AS \"__time\",\n",
+ " \"wikipedia_main\".*\n",
+ "FROM \"wikipedia_main\"\n",
+ "LEFT JOIN \"wikipedia_dim\" \n",
+ "ON \"wikipedia_main\".\"channel\" = \"wikipedia_dim\".\"channel\"\n",
+ "WHERE \"wikipedia_main\".\"timestamp\" >='2016-01-01' AND
\"wikipedia_main\".\"timestamp\" <'2016-02-01'\n",
+ "PARTITIONED BY MONTH\n",
+ "'''"
+ ]
+ },
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "id": "e10df053-2729-4e2c-ac4a-3c8d0c070dc0",
+ "metadata": {},
+ "source": [
+ "### Let's watch the ingestion task running...\n",
+ "This may take a while, you can check the status of the ingestion task in
the [Overlord console](http://localhost:8090/console.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9d302e43-9f14-4d19-b286-7a3cbc448470",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "sql_request={'query': query}\n",
+ "endpoint = druid_host + '/druid/v2/sql/task'\n",
+ "response = session.post(endpoint, json=sql_request)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eadf05f7-bc0a-4a29-981d-d8bc5fd72314",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "ingestion_taskId = response.json()['taskId']\n",
+ "endpoint = druid_host +
f\"/druid/indexer/v1/task/{ingestion_taskId}/status\"\n",
+ "import time\n",
+ "\n",
+ "json = session.get(endpoint).json()\n",
+ "ingestion_status = json['status']['status']\n",
+ " \n",
+ "if ingestion_status == \"RUNNING\":\n",
+ " print(\"The ingestion is running...\")\n",
+ "\n",
+ "while ingestion_status != \"SUCCESS\":\n",
+ " time.sleep(1)\n",
+ " json = session.get(endpoint).json()\n",
+ " ingestion_status = json['status']['status']\n",
+ " print('.', end='')\n",
+ "\n",
+ "if ingestion_status == \"SUCCESS\": \n",
+ " print(\"\\nThe ingestion is complete\")\n",
+ "else:\n",
+ " print(\"\\nThe ingestion task failed:\", json)\n"
+ ]
Review Comment:
Maybe need to change the WHILE condition to something like `while
ingestion_status == \"RUNNING\"`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]