Re: [PR] Jupyter notebook tutorial - Delete API (druid)

via GitHub Mon, 14 Aug 2023 15:27:10 -0700


demo-kratia commented on code in PR #14781:
URL: https://github.com/apache/druid/pull/14781#discussion_r1294026663



##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,975 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "71bdcc40",
+   "metadata": {},
+   "source": [
+    "# Learn to delete data with Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "In working with data, Druid retains a copies of the existing data 
segments in deep storage and Historical processes. As new data is added into 
Druid, deep storage grows and becomes larger over time unless explicitly 
removed.\n",
+    "\n",
+    "While deep storage is an important part of Druid's elastic, 
fault-tolerant design, data accumulation over time in deep storage can lead to 
increased storage costs. Periodically deleting data can reclaim storage space 
and promote optimal resource allocation.\n",
+    "\n",
+    "This notebook provides a tutorial on deleting existing data in Druid 
using the Coordinator API endpoints. \n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Deletion steps](#Deletion-steps)\n",
+    "- [Delete by time interval](#Delete-by-time-interval)\n",
+    "- [Delete entire table](#Delete-entire-table)\n",
+    "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+    "- [Conclusion](#Conclusion)\n",
+    "- [Learn more](#Learn-more)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fc260fc",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter`, 
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for 
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter 
Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Druid instance.<br>\n",
+    "     Update the `druid_host` variable to point to your Router endpoint. 
For example:\n",
+    "     ```\n",
+    "     druid_host = \"http://localhost:8888\"\n";,
+    "     ```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b8a7510",
+   "metadata": {},
+   "source": [
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens.\n",
+    "\n",
+    "`druid_host` is the hostname and port for your Druid deployment. In a 
distributed environment, you can point to other Druid services. In this 
tutorial, you'll use the Router service as the `druid_host`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ed52d809",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://router:8888\"\n";,
+    "druid_host"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e429b61e",
+   "metadata": {},
+   "source": [
+    "If your cluster is secure, you'll need to provide authorization 
information on each request. You can automate it by using the Requests 
`session` feature. Although this tutorial assumes no authorization, the 
configuration below defines a session as an example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "cfa75fc5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "session = requests.Session()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f3c9a92",
+   "metadata": {},
+   "source": [
+    "Before proceeding with the tutorial, use the `/status/health` endpoint to 
verify that the cluster if up and running. This endpoint returns the value 
`true` if the Druid cluster has finished starting up and is running. Do not 
move on from this point if the following call does not return `true`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18a8a495",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/status/health'\n",
+    "response = session.get(endpoint)\n",
+    "response.text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19144be9",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint` and other variables are 
updated in code cells to call a different Druid endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a281144",
+   "metadata": {},
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Apache Druid stores data partitioned by time chunks into segments and 
supports deleting data by dropping segments. To start, ingest the quickstart 
Wikipedia data and partition it by hour to create multiple segments.\n",
+    "\n",
+    "First, set the endpoint to the `sql/task` endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aa1e227f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/v2/sql/task'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02e4f551",
+   "metadata": {},
+   "source": [
+    "Next, use the multi-stage query (MSQ) task engine and its `sql/task` 
endpoint to perform SQL-based ingestion and create a `wikipedia_hour` 
datasource with hour segmentation. \n",
+    "\n",
+    "To learn more about SQL-based ingestion, see [SQL-based 
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html). 
For information about the endpoint specifically, see [SQL-based ingestion and 
multi-stage query task 
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1208f3ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql = '''\n",
+    "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n",
+    "WITH \"ext\" AS (SELECT *\n",
+    "FROM TABLE(\n",
+    "  EXTERN(\n",
+    "    
'{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n",
+    "    '{\"type\":\"json\"}'\n",
+    "  )\n",
+    ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR, 
\"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR, 
\"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\" 
VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\" 
VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR, 
\"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n",
+    "SELECT\n",
+    "  TIME_PARSE(\"time\") AS \"__time\",\n",
+    "  \"channel\",\n",
+    "  \"cityName\",\n",
+    "  \"comment\",\n",
+    "  \"countryIsoCode\",\n",
+    "  \"countryName\",\n",
+    "  \"isAnonymous\",\n",
+    "  \"isMinor\",\n",
+    "  \"isNew\",\n",
+    "  \"isRobot\",\n",
+    "  \"isUnpatrolled\",\n",
+    "  \"metroCode\",\n",
+    "  \"namespace\",\n",
+    "  \"page\",\n",
+    "  \"regionIsoCode\",\n",
+    "  \"regionName\",\n",
+    "  \"user\",\n",
+    "  \"delta\",\n",
+    "  \"added\",\n",
+    "  \"deleted\"\n",
+    "FROM \"ext\"\n",
+    "PARTITIONED BY HOUR\n",
+    "'''"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cf78bb7",
+   "metadata": {},
+   "source": [
+    "The following cell cell builds up a Python map that represents the Druid 
`SqlRequest` object."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "543b03ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "    'query': sql\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8312c9d4",
+   "metadata": {},
+   "source": [
+    "With the SQL request ready, use the the `json` parameter to the `Session` 
`post` method to send a `POST` request with the `sql_request` object as the 
payload. The result is a Requests `Response` which is saved in a variable.\n",
+    "\n",
+    "Now, run the next cell to start the ingestion. You will see an asterisk 
`[*]` in the left margin while the task runs. It takes a while for Druid to 
load the resulting segments. Wait for the table to become ready."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9540926f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = session.post(endpoint, json=sql_request)\n",
+    "response.status_code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cab33e7e",
+   "metadata": {},
+   "source": [
+    "Once the data has been ingested, Druid is populated with segments for 
each segment interval that contains data. You should see 24 segments associated 
with `wikipedia_hour`. \n",
+    "\n",
+    "For demonstration, let's view the segments generated for the 
`wikipedia_hour` datasource before any deletion is made. Run the following cell 
to set the endpoint to `/druid/v2/sql`. For more information on this endpoint, 
see [Druid SQL 
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+    "\n",
+    "Using this endpoint, you can query the `sys` [metadata 
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "956abeee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "701550dd",
+   "metadata": {},
+   "source": [
+    "Now, you can query the metadata table to retrieve segment information. 
The following cell sends a SQL query to retrieve `segment_id` information for 
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to 
`objectLines`. This helps format the response with newlines and makes it easier 
to parse the output."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bb54a6b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "  \"query\": \"SELECT segment_id FROM sys.segments WHERE 
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+    "  \"resultFormat\": \"objectLines\"\n",
+    "}\n",
+    "\n",
+    "response = session.post(endpoint, json=sql_request)\n",
+    "\n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f06e24e5",
+   "metadata": {},
+   "source": [
+    "Observe the response retrieved from the previous cell. In total, there 
are 24 `segment_id` records, each containing the datasource name 
`wikipedia_hour`, along with the start and end hour interval. The tail end of 
the ID also contains the timestamp of when the request was made. \n",
+    "\n",
+    "For this tutorial, we are concerned with observing the start and end 
interval for each `segment_id`. \n",
+    "\n",
+    "For example: \n",
+    
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
 indicates this segment contains data from `2015-09-12T00:00:00.000` to 
`2015-09-12T01:00:00.000Z`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ca79f5f9",
+   "metadata": {},
+   "source": [
+    "## Deletion steps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b6cd1c8c",
+   "metadata": {},
+   "source": [
+    "Permanent deletion of a segment in Druid has two steps:\n",
+    "\n",
+    "1. Mark a segment as \"unused.\" This step occurs when a segment is 
dropped by a [drop 
rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules)
 or manually marked as \"unused\" through the Coordinator API or web console. 
Note that marking a segment as \"unused\" is a soft delete, it is no longer 
available for querying but the segment files remain in deep storage and segment 
records remain in the metadata store. \n",
+    "2. Send a kill task to permanently remove \"unused\" segments. This 
deletes the segment file from deep storage and removes its record from the 
metadata store. This is a hard delete: the data is unrecoverable unless you 
have a backup."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9bc7f00",
+   "metadata": {},
+   "source": [
+    "## Delete by time interval"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1040bdaf",
+   "metadata": {},
+   "source": [
+    "Segments can be deleted in a specified time interval. This begins with 
marking all segments in the interval as \"unused\", then sending a kill request 
to delete it permanently from deep storage.\n",
+    "\n",
+    "First, set the endpoint variable to the Coordinator API endpoint 
`/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the 
datasource ingested is `wikipedia_hour`, let's specify that in the endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9db8786d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + 
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "863576a9",
+   "metadata": {},
+   "source": [
+    "The following cell constructs a JSON payload with the interval of 
segments to be deleted. This marks the intervals from `18:00:00.000` to 
`20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the 
endpoint in a `POST` request."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "79387e72",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "  \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n",
+    "}\n",
+    "response = session.post(endpoint, json=sql_request)\n",
+    "\n",
+    "response.text"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89e2fcb4",
+   "metadata": {},
+   "source": [
+    "The response from the above cell should return a JSON object with the 
property `\"numChangedSegments\"` and the value `2`. This refers to the 
following segments:\n",
+    "\n",
+    "* 
`{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z\"}`\n",
+    "* 
`{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z\"}`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e61cae23",
+   "metadata": {},
+   "source": [
+    "Next, verify that the segments have been soft deleted. The following cell 
sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request 
querying for the existing `segment_id`s. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ea7c0d26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "sql_request = {\n",
+    "  \"query\": \"SELECT segment_id FROM sys.segments WHERE 
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+    "  \"resultFormat\": \"objectLines\"\n",
+    "}\n",
+    "\n",
+    "response = session.post(endpoint, json=sql_request)\n",
+    "\n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "747bd12c",
+   "metadata": {},
+   "source": [
+    "Observe the response above. There should now be only 22 segments, and the 
\"unused\" segments have been soft deleted. \n",
+    "\n",
+    "However, as you've only soft deleted the segments, it remains in deep 
storage.\n",
+    "\n",
+    "Before permanently deleting the segments, you can verify that they've 
only been soft deleted by inspecting your deep storage. The soft deleted 
segments are still there. This step is optional, you can move onto the next set 
of cells without completing this step."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "943b36cc",
+   "metadata": {},
+   "source": [
+    "[OPTIONAL] If you are running Druid externally from the Docker Compose 
environment, follow these instructions to view segments in deep storage:\n",
+    "    \n",
+    "* Navigate to the distribution directory for Druid, this is the same 
place where you run `./bin/start-druid` to start up Druid.\n",
+    "* Run this command: `ls -l1 var/druid/segments/wikipedia_hour`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ecedcaa",
+   "metadata": {},
+   "source": [
+    "[OPTIONAL] If you are running Druid within the Docker Compose 
environment, follow these instructions to view the segments in deep storage:\n",
+    "\n",
+    "* Navigate to your Docker terminal.\n",
+    "* Run this command: `docker exec -it historical ls 
/opt/shared/segments/wikipedia_hour`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12737023",
+   "metadata": {},
+   "source": [
+    "The output should look similar to this:\n",
+    "\n",
+    "```bash\n",
+    "$ ls -l1 var/druid/segments/wikipedia_hour/\n",
+    "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n",
+    "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n",
+    "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n",
+    "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n",
+    "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n",
+    "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n",
+    "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n",
+    "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n",
+    "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n",
+    "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n",
+    "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n",
+    "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n",
+    "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n",
+    "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n",
+    "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n",
+    "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n",
+    "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n",
+    "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n",
+    "2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z\n",
+    "2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z\n",
+    "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n",
+    "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n",
+    "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n",
+    "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "38cca397",
+   "metadata": {},
+   "source": [
+    "Now, you can move onto sending a kill task to permanently delete the 
segments from deep storage. This can be done with the 
`/druid/coordinator/v1/datasources/:dataSource/intervals/:interval` 
endpoint.\n",
+    "\n",
+    "The following cell uses the endpoint, setting the `dataSource` path 
parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. \n",
+    "\n",
+    "Notice that the interval is set to `2015-09-12_2015-09-13` which covers 
the entirety of the 22 segments. Druid only permanently delete the \"unused\" 
segments within this interval. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "672ad739",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + 
'/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "751380d5",
+   "metadata": {},
+   "source": [
+    "Run the next cell to send the `DELETE` request."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2a6bdc6c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = session.delete(endpoint);\n",
+    "print(response.status_code)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69d6e89a",
+   "metadata": {},
+   "source": [
+    "Last, observe that the segments have been deleted from deep storage in 
the following sample output. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84ee47fd",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "$ ls -l1 var/druid/segments/wikipedia_hour/\n",
+    "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n",
+    "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n",
+    "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n",
+    "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n",
+    "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n",
+    "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n",
+    "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n",
+    "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n",
+    "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n",
+    "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n",
+    "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n",
+    "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n",
+    "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n",
+    "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n",
+    "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n",
+    "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n",
+    "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n",
+    "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n",
+    "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n",
+    "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n",
+    "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n",
+    "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0d0578c",
+   "metadata": {},
+   "source": [
+    "## Delete entire table"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8d0260a",
+   "metadata": {},
+   "source": [
+    "You can delete entire tables the same way you can delete parts of a 
table, using intervals.\n",
+    "\n",
+    "Run the following cell to reset the endpoint to 
`/druid/coordinator/v1/datasources/:dataSource/markUnused`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dd354886",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + 
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed3cf7d3",
+   "metadata": {},
+   "source": [
+    "Next, send a `POST` with the payload `{\"interval\": 
\"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"}` to mark the entirety of 
the table as \"unused.\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "25639752",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sql_request = {\n",
+    "  \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n",
+    "}\n",
+    "\n",
+    "response = session.post(endpoint, json=sql_request);\n",
+    "\n",
+    "response.status_code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bbbba823",
+   "metadata": {},
+   "source": [
+    "To verify the segment changes, the following cell sets the endpoint to 
`/druid/v2/sql` and send a SQL-based request. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eac1db4c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/v2/sql'\n",
+    "sql_request = {\n",
+    "  \"query\": \"SELECT segment_id FROM sys.segments WHERE 
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+    "  \"resultFormat\": \"objectLines\"\n",
+    "}\n",
+    "\n",
+    "response = session.post(endpoint, json=sql_request);"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "deae8727",
+   "metadata": {},
+   "source": [
+    "Run the next cells to view the response. You should see that the 
`response.text` returns nothing, but `response.status_code` returns a 200. \n",
+    "\n",
+    "The response should return the remaining segments, but since the table 
was deleted, there are no segments to return."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "12c11291",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response.text\n",
+    "response.status_code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e9a74da",
+   "metadata": {},
+   "source": [
+    "So far, you've soft deleted the table. Run the following cells to 
permanently delete the table from deep storage:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c3d7ec9",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + 
'/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "98834167",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = session.delete(endpoint);\n",
+    "response.status_code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b2d59c8",
+   "metadata": {},
+   "source": [
+    "## Delete by segment ID"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a4e8453e",
+   "metadata": {},
+   "source": [
+    "In addition to deleting by interval, you can delete segments by using 
`segment_id`. Let's load in some new data to work with.\n",

Review Comment:
   I grouped deleting by interval and table together since they both use the 
same base method, passing a time interval. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Jupyter notebook tutorial - Delete API (druid)

Reply via email to