Re: [PR] Jupyter notebook tutorial - Delete API (druid)

via GitHub Fri, 11 Aug 2023 15:11:47 -0700


vtlim commented on code in PR #14781:
URL: https://github.com/apache/druid/pull/14781#discussion_r1291820344



##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,975 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "71bdcc40",
+   "metadata": {},
+   "source": [
+    "# Learn to delete data with Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "In working with data, Druid retains a copies of the existing data 
segments in deep storage and Historical processes. As new data is added into 
Druid, deep storage grows and becomes larger over time unless explicitly 
removed.\n",
+    "\n",
+    "While deep storage is an important part of Druid's elastic, 
fault-tolerant design, data accumulation over time in deep storage can lead to 
increased storage costs. Periodically deleting data can reclaim storage space 
and promote optimal resource allocation.\n",
+    "\n",
+    "This notebook provides a tutorial on deleting existing data in Druid 
using the Coordinator API endpoints. \n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Deletion steps](#Deletion-steps)\n",
+    "- [Delete by time interval](#Delete-by-time-interval)\n",
+    "- [Delete entire table](#Delete-entire-table)\n",
+    "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+    "- [Conclusion](#Conclusion)\n",
+    "- [Learn more](#Learn-more)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fc260fc",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter`, 
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for 
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter 
Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Druid instance.<br>\n",
+    "     Update the `druid_host` variable to point to your Router endpoint. 
For example:\n",
+    "     ```\n",
+    "     druid_host = \"http://localhost:8888\"\n";,
+    "     ```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b8a7510",
+   "metadata": {},
+   "source": [
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens.\n",
+    "\n",
+    "`druid_host` is the hostname and port for your Druid deployment. In a 
distributed environment, you can point to other Druid services. In this 
tutorial, you'll use the Router service as the `druid_host`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ed52d809",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e429b61e",
+   "metadata": {},
+   "source": [
+    "If your cluster is secure, you'll need to provide authorization 
information on each request. You can automate it by using the Requests 
`session` feature. Although this tutorial assumes no authorization, the 
configuration below defines a session as an example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cfa75fc5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "session = requests.Session()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f3c9a92",
+   "metadata": {},
+   "source": [
+    "Before proceeding with the tutorial, use the `/status/health` endpoint to 
verify that the cluster if up and running. This endpoint returns the value 
`true` if the Druid cluster has finished starting up and is running. Do not 
move on from this point if the following call does not return `true`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18a8a495",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/status/health'\n",
+    "response = session.get(endpoint)\n",
+    "response.text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19144be9",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint` and other variables are 
updated in code cells to call a different Druid endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a281144",
+   "metadata": {},
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Apache Druid stores data partitioned by time chunks into segments and 
supports deleting data by dropping segments. To start, ingest the quickstart 
Wikipedia data and partition it by hour to create multiple segments.\n",
+    "\n",
+    "First, set the endpoint to the `sql/task` endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa1e227f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/v2/sql/task'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02e4f551",
+   "metadata": {},
+   "source": [
+    "Next, use the multi-stage query (MSQ) task engine and its `sql/task` 
endpoint to perform SQL-based ingestion and create a `wikipedia_hour` 
datasource with hour segmentation. \n",
+    "\n",
+    "To learn more about SQL-based ingestion, see [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). 
For information about the endpoint specifically, see [SQL-based ingestion and 
multi-stage query task 
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1208f3ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR, 
\"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR, 
\"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\" 
VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\" 
VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR, 
\"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"time\") AS \"__time\",\n",
+    "  \"channel\",\n",
+    "  \"cityName\",\n",
+    "  \"comment\",\n",
+    "  \"countryIsoCode\",\n",
+    "  \"countryName\",\n",
+    "  \"isAnonymous\",\n",
+    "  \"isMinor\",\n",
+    "  \"isNew\",\n",
+    "  \"isRobot\",\n",
+    "  \"isUnpatrolled\",\n",
+    "  \"metroCode\",\n",
+    "  \"namespace\",\n",
+    "  \"page\",\n",
+    "  \"regionIsoCode\",\n",
+    "  \"regionName\",\n",
+    "  \"user\",\n",
+    "  \"delta\",\n",
+    "  \"added\",\n",
+    "  \"deleted\"\n",
+    "FROM \"ext\"\n",
+    "PARTITIONED BY HOUR\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cf78bb7",
+   "metadata": {},
+   "source": [
+    "The following cell cell builds up a Python map that represents the Druid 
`SqlRequest` object."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "543b03ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "    'query': sql\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8312c9d4",
+   "metadata": {},
+   "source": [
+    "With the SQL request ready, use the the `json` parameter to the `Session` 
`post` method to send a `POST` request with the `sql_request` object as the 
payload. The result is a Requests `Response` which is saved in a variable.\n",
+    "\n",
+    "Now, run the next cell to start the ingestion. You will see an asterisk 
`[*]` in the left margin while the task runs. It takes a while for Druid to 
load the resulting segments. Wait for the table to become ready."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9540926f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = session.post(endpoint, json=sql_request)\n",
+    "response.status_code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cab33e7e",
+   "metadata": {},
+   "source": [
+    "Once the data has been ingested, Druid is populated with segments for 
each segment interval that contains data. You should see 24 segments associated 
with `wikipedia_hour`. \n",
+    "\n",
+    "For demonstration, let's view the segments generated for the 
`wikipedia_hour` datasource before any deletion is made. Run the following cell 
to set the endpoint to `/druid/v2/sql`. For more information on this endpoint, 
see [Druid SQL 
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+    "\n",
+    "Using this endpoint, you can query the `sys` [metadata 
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "956abeee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "701550dd",
+   "metadata": {},
+   "source": [
+    "Now, you can query the metadata table to retrieve segment information. 
The following cell sends a SQL query to retrieve `segment_id` information for 
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to 
`objectLines`. This helps format the response with newlines and makes it easier 
to parse the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bb54a6b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "  \"query\": \"SELECT segment_id FROM sys.segments WHERE 
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+    "  \"resultFormat\": \"objectLines\"\n",
+    "}\n",
+    "\n",
+    "response = session.post(endpoint, json=sql_request)\n",
+    "\n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f06e24e5",
+   "metadata": {},
+   "source": [
+    "Observe the response retrieved from the previous cell. In total, there 
are 24 `segment_id` records, each containing the datasource name 
`wikipedia_hour`, along with the start and end hour interval. The tail end of 
the ID also contains the timestamp of when the request was made. \n",
+    "\n",
+    "For this tutorial, we are concerned with observing the start and end 
interval for each `segment_id`. \n",
+    "\n",
+    "For example: \n",
+    
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
 indicates this segment contains data from `2015-09-12T00:00:00.000` to 
`2015-09-12T01:00:00.000Z`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca79f5f9",
+   "metadata": {},
+   "source": [
+    "## Deletion steps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6cd1c8c",
+   "metadata": {},
+   "source": [
+    "Permanent deletion of a segment in Druid has two steps:\n",
+    "\n",
+    "1. Mark a segment as \"unused.\" This step occurs when a segment is 
dropped by a [drop 
rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules)
 or manually marked as \"unused\" through the Coordinator API or web console. 
Note that marking a segment as \"unused\" is a soft delete, it is no longer 
available for querying but the segment files remain in deep storage and segment 
records remain in the metadata store. \n",
+    "2. Send a kill task to permanently remove \"unused\" segments. This 
deletes the segment file from deep storage and removes its record from the 
metadata store. This is a hard delete: the data is unrecoverable unless you 
have a backup."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9bc7f00",
+   "metadata": {},
+   "source": [
+    "## Delete by time interval"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1040bdaf",
+   "metadata": {},
+   "source": [
+    "Segments can be deleted in a specified time interval. This begins with 
marking all segments in the interval as \"unused\", then sending a kill request 
to delete it permanently from deep storage.\n",
+    "\n",
+    "First, set the endpoint variable to the Coordinator API endpoint 
`/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the 
datasource ingested is `wikipedia_hour`, let's specify that in the endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9db8786d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + 
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "863576a9",
+   "metadata": {},
+   "source": [
+    "The following cell constructs a JSON payload with the interval of 
segments to be deleted. This marks the intervals from `18:00:00.000` to 
`20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the 
endpoint in a `POST` request."

Review Comment:
   IIRC the original tutorial has this so we can just stay consistent with it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Jupyter notebook tutorial - Delete API (druid)

Reply via email to