demo-kratia commented on code in PR #14781:
URL: https://github.com/apache/druid/pull/14781#discussion_r1294026663
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,975 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, data accumulation over time in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "- [Conclusion](#Conclusion)\n",
+ "- [Learn more](#Learn-more)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://router:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e429b61e",
+ "metadata": {},
+ "source": [
+ "If your cluster is secure, you'll need to provide authorization
information on each request. You can automate it by using the Requests
`session` feature. Although this tutorial assumes no authorization, the
configuration below defines a session as an example."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cfa75fc5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "session = requests.Session()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before proceeding with the tutorial, use the `/status/health` endpoint to
verify that the cluster if up and running. This endpoint returns the value
`true` if the Druid cluster has finished starting up and is running. Do not
move on from this point if the following call does not return `true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = session.get(endpoint)\n",
+ "response.text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. To start, ingest the quickstart
Wikipedia data and partition it by hour to create multiple segments.\n",
+ "\n",
+ "First, set the endpoint to the `sql/task` endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "aa1e227f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, use the multi-stage query (MSQ) task engine and its `sql/task`
endpoint to perform SQL-based ingestion and create a `wikipedia_hour`
datasource with hour segmentation. \n",
+ "\n",
+ "To learn more about SQL-based ingestion, see [SQL-based
ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html).
For information about the endpoint specifically, see [SQL-based ingestion and
multi-stage query task
API](https://druid.apache.org/docs/latest/multi-stage-query/api.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1208f3ac",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql = '''\n",
+ "REPLACE INTO \"wikipedia_hour\" OVERWRITE ALL\n",
+ "WITH \"ext\" AS (SELECT *\n",
+ "FROM TABLE(\n",
+ " EXTERN(\n",
+ "
'{\"type\":\"local\",\"filter\":\"wikiticker-2015-09-12-sampled.json.gz\",\"baseDir\":\"quickstart/tutorial/\"}',\n",
+ " '{\"type\":\"json\"}'\n",
+ " )\n",
+ ") EXTEND (\"time\" VARCHAR, \"channel\" VARCHAR, \"cityName\" VARCHAR,
\"comment\" VARCHAR, \"countryIsoCode\" VARCHAR, \"countryName\" VARCHAR,
\"isAnonymous\" VARCHAR, \"isMinor\" VARCHAR, \"isNew\" VARCHAR, \"isRobot\"
VARCHAR, \"isUnpatrolled\" VARCHAR, \"metroCode\" BIGINT, \"namespace\"
VARCHAR, \"page\" VARCHAR, \"regionIsoCode\" VARCHAR, \"regionName\" VARCHAR,
\"user\" VARCHAR, \"delta\" BIGINT, \"added\" BIGINT, \"deleted\" BIGINT))\n",
+ "SELECT\n",
+ " TIME_PARSE(\"time\") AS \"__time\",\n",
+ " \"channel\",\n",
+ " \"cityName\",\n",
+ " \"comment\",\n",
+ " \"countryIsoCode\",\n",
+ " \"countryName\",\n",
+ " \"isAnonymous\",\n",
+ " \"isMinor\",\n",
+ " \"isNew\",\n",
+ " \"isRobot\",\n",
+ " \"isUnpatrolled\",\n",
+ " \"metroCode\",\n",
+ " \"namespace\",\n",
+ " \"page\",\n",
+ " \"regionIsoCode\",\n",
+ " \"regionName\",\n",
+ " \"user\",\n",
+ " \"delta\",\n",
+ " \"added\",\n",
+ " \"deleted\"\n",
+ "FROM \"ext\"\n",
+ "PARTITIONED BY HOUR\n",
+ "'''"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "The following cell cell builds up a Python map that represents the Druid
`SqlRequest` object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_request = {\n",
+ " 'query': sql\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8312c9d4",
+ "metadata": {},
+ "source": [
+ "With the SQL request ready, use the the `json` parameter to the `Session`
`post` method to send a `POST` request with the `sql_request` object as the
payload. The result is a Requests `Response` which is saved in a variable.\n",
+ "\n",
+ "Now, run the next cell to start the ingestion. You will see an asterisk
`[*]` in the left margin while the task runs. It takes a while for Druid to
load the resulting segments. Wait for the table to become ready."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9540926f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = session.post(endpoint, json=sql_request)\n",
+ "response.status_code"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid is populated with segments for
each segment interval that contains data. You should see 24 segments associated
with `wikipedia_hour`. \n",
+ "\n",
+ "For demonstration, let's view the segments generated for the
`wikipedia_hour` datasource before any deletion is made. Run the following cell
to set the endpoint to `/druid/v2/sql`. For more information on this endpoint,
see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+ "\n",
+ "Using this endpoint, you can query the `sys` [metadata
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "956abeee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "701550dd",
+ "metadata": {},
+ "source": [
+ "Now, you can query the metadata table to retrieve segment information.
The following cell sends a SQL query to retrieve `segment_id` information for
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to
`objectLines`. This helps format the response with newlines and makes it easier
to parse the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb54a6b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_request = {\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "}\n",
+ "\n",
+ "response = session.post(endpoint, json=sql_request)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f06e24e5",
+ "metadata": {},
+ "source": [
+ "Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id` records, each containing the datasource name
`wikipedia_hour`, along with the start and end hour interval. The tail end of
the ID also contains the timestamp of when the request was made. \n",
+ "\n",
+ "For this tutorial, we are concerned with observing the start and end
interval for each `segment_id`. \n",
+ "\n",
+ "For example: \n",
+
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
indicates this segment contains data from `2015-09-12T00:00:00.000` to
`2015-09-12T01:00:00.000Z`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca79f5f9",
+ "metadata": {},
+ "source": [
+ "## Deletion steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6cd1c8c",
+ "metadata": {},
+ "source": [
+ "Permanent deletion of a segment in Druid has two steps:\n",
+ "\n",
+ "1. Mark a segment as \"unused.\" This step occurs when a segment is
dropped by a [drop
rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules)
or manually marked as \"unused\" through the Coordinator API or web console.
Note that marking a segment as \"unused\" is a soft delete, it is no longer
available for querying but the segment files remain in deep storage and segment
records remain in the metadata store. \n",
+ "2. Send a kill task to permanently remove \"unused\" segments. This
deletes the segment file from deep storage and removes its record from the
metadata store. This is a hard delete: the data is unrecoverable unless you
have a backup."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b9bc7f00",
+ "metadata": {},
+ "source": [
+ "## Delete by time interval"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1040bdaf",
+ "metadata": {},
+ "source": [
+ "Segments can be deleted in a specified time interval. This begins with
marking all segments in the interval as \"unused\", then sending a kill request
to delete it permanently from deep storage.\n",
+ "\n",
+ "First, set the endpoint variable to the Coordinator API endpoint
`/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the
datasource ingested is `wikipedia_hour`, let's specify that in the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9db8786d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host +
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "863576a9",
+ "metadata": {},
+ "source": [
+ "The following cell constructs a JSON payload with the interval of
segments to be deleted. This marks the intervals from `18:00:00.000` to
`20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the
endpoint in a `POST` request."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "79387e72",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_request = {\n",
+ " \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n",
+ "}\n",
+ "response = session.post(endpoint, json=sql_request)\n",
+ "\n",
+ "response.text"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "89e2fcb4",
+ "metadata": {},
+ "source": [
+ "The response from the above cell should return a JSON object with the
property `\"numChangedSegments\"` and the value `2`. This refers to the
following segments:\n",
+ "\n",
+ "*
`{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z\"}`\n",
+ "*
`{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z\"}`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e61cae23",
+ "metadata": {},
+ "source": [
+ "Next, verify that the segments have been soft deleted. The following cell
sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request
querying for the existing `segment_id`s. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ea7c0d26",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "sql_request = {\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "}\n",
+ "\n",
+ "response = session.post(endpoint, json=sql_request)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "747bd12c",
+ "metadata": {},
+ "source": [
+ "Observe the response above. There should now be only 22 segments, and the
\"unused\" segments have been soft deleted. \n",
+ "\n",
+ "However, as you've only soft deleted the segments, it remains in deep
storage.\n",
+ "\n",
+ "Before permanently deleting the segments, you can verify that they've
only been soft deleted by inspecting your deep storage. The soft deleted
segments are still there. This step is optional, you can move onto the next set
of cells without completing this step."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "943b36cc",
+ "metadata": {},
+ "source": [
+ "[OPTIONAL] If you are running Druid externally from the Docker Compose
environment, follow these instructions to view segments in deep storage:\n",
+ " \n",
+ "* Navigate to the distribution directory for Druid, this is the same
place where you run `./bin/start-druid` to start up Druid.\n",
+ "* Run this command: `ls -l1 var/druid/segments/wikipedia_hour`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8ecedcaa",
+ "metadata": {},
+ "source": [
+ "[OPTIONAL] If you are running Druid within the Docker Compose
environment, follow these instructions to view the segments in deep storage:\n",
+ "\n",
+ "* Navigate to your Docker terminal.\n",
+ "* Run this command: `docker exec -it historical ls
/opt/shared/segments/wikipedia_hour`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "12737023",
+ "metadata": {},
+ "source": [
+ "The output should look similar to this:\n",
+ "\n",
+ "```bash\n",
+ "$ ls -l1 var/druid/segments/wikipedia_hour/\n",
+ "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n",
+ "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n",
+ "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n",
+ "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n",
+ "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n",
+ "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n",
+ "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n",
+ "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n",
+ "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n",
+ "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n",
+ "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n",
+ "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n",
+ "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n",
+ "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n",
+ "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n",
+ "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n",
+ "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n",
+ "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n",
+ "2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z\n",
+ "2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z\n",
+ "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n",
+ "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n",
+ "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n",
+ "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38cca397",
+ "metadata": {},
+ "source": [
+ "Now, you can move onto sending a kill task to permanently delete the
segments from deep storage. This can be done with the
`/druid/coordinator/v1/datasources/:dataSource/intervals/:interval`
endpoint.\n",
+ "\n",
+ "The following cell uses the endpoint, setting the `dataSource` path
parameter as `wikipedia_hour` with the interval `2015-09-12_2015-09-13`. \n",
+ "\n",
+ "Notice that the interval is set to `2015-09-12_2015-09-13` which covers
the entirety of the 22 segments. Druid only permanently delete the \"unused\"
segments within this interval. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "672ad739",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host +
'/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "751380d5",
+ "metadata": {},
+ "source": [
+ "Run the next cell to send the `DELETE` request."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2a6bdc6c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = session.delete(endpoint);\n",
+ "print(response.status_code)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "69d6e89a",
+ "metadata": {},
+ "source": [
+ "Last, observe that the segments have been deleted from deep storage in
the following sample output. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "84ee47fd",
+ "metadata": {},
+ "source": [
+ "```bash\n",
+ "$ ls -l1 var/druid/segments/wikipedia_hour/\n",
+ "2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z\n",
+ "2015-09-12T01:00:00.000Z_2015-09-12T02:00:00.000Z\n",
+ "2015-09-12T02:00:00.000Z_2015-09-12T03:00:00.000Z\n",
+ "2015-09-12T03:00:00.000Z_2015-09-12T04:00:00.000Z\n",
+ "2015-09-12T04:00:00.000Z_2015-09-12T05:00:00.000Z\n",
+ "2015-09-12T05:00:00.000Z_2015-09-12T06:00:00.000Z\n",
+ "2015-09-12T06:00:00.000Z_2015-09-12T07:00:00.000Z\n",
+ "2015-09-12T07:00:00.000Z_2015-09-12T08:00:00.000Z\n",
+ "2015-09-12T08:00:00.000Z_2015-09-12T09:00:00.000Z\n",
+ "2015-09-12T09:00:00.000Z_2015-09-12T10:00:00.000Z\n",
+ "2015-09-12T10:00:00.000Z_2015-09-12T11:00:00.000Z\n",
+ "2015-09-12T11:00:00.000Z_2015-09-12T12:00:00.000Z\n",
+ "2015-09-12T12:00:00.000Z_2015-09-12T13:00:00.000Z\n",
+ "2015-09-12T13:00:00.000Z_2015-09-12T14:00:00.000Z\n",
+ "2015-09-12T14:00:00.000Z_2015-09-12T15:00:00.000Z\n",
+ "2015-09-12T15:00:00.000Z_2015-09-12T16:00:00.000Z\n",
+ "2015-09-12T16:00:00.000Z_2015-09-12T17:00:00.000Z\n",
+ "2015-09-12T17:00:00.000Z_2015-09-12T18:00:00.000Z\n",
+ "2015-09-12T20:00:00.000Z_2015-09-12T21:00:00.000Z\n",
+ "2015-09-12T21:00:00.000Z_2015-09-12T22:00:00.000Z\n",
+ "2015-09-12T22:00:00.000Z_2015-09-12T23:00:00.000Z\n",
+ "2015-09-12T23:00:00.000Z_2015-09-13T00:00:00.000Z\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0d0578c",
+ "metadata": {},
+ "source": [
+ "## Delete entire table"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b8d0260a",
+ "metadata": {},
+ "source": [
+ "You can delete entire tables the same way you can delete parts of a
table, using intervals.\n",
+ "\n",
+ "Run the following cell to reset the endpoint to
`/druid/coordinator/v1/datasources/:dataSource/markUnused`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dd354886",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host +
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ed3cf7d3",
+ "metadata": {},
+ "source": [
+ "Next, send a `POST` with the payload `{\"interval\":
\"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"}` to mark the entirety of
the table as \"unused.\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "25639752",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sql_request = {\n",
+ " \"interval\": \"2015-09-12T00:00:00.000Z/2015-09-13T01:00:00.000Z\"\n",
+ "}\n",
+ "\n",
+ "response = session.post(endpoint, json=sql_request);\n",
+ "\n",
+ "response.status_code"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bbbba823",
+ "metadata": {},
+ "source": [
+ "To verify the segment changes, the following cell sets the endpoint to
`/druid/v2/sql` and send a SQL-based request. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "eac1db4c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "sql_request = {\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "}\n",
+ "\n",
+ "response = session.post(endpoint, json=sql_request);"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "deae8727",
+ "metadata": {},
+ "source": [
+ "Run the next cells to view the response. You should see that the
`response.text` returns nothing, but `response.status_code` returns a 200. \n",
+ "\n",
+ "The response should return the remaining segments, but since the table
was deleted, there are no segments to return."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "12c11291",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response.text\n",
+ "response.status_code"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2e9a74da",
+ "metadata": {},
+ "source": [
+ "So far, you've soft deleted the table. Run the following cells to
permanently delete the table from deep storage:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6c3d7ec9",
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host +
'/druid/coordinator/v1/datasources/wikipedia_hour/intervals/2015-09-12_2015-09-13'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "98834167",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = session.delete(endpoint);\n",
+ "response.status_code"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b2d59c8",
+ "metadata": {},
+ "source": [
+ "## Delete by segment ID"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a4e8453e",
+ "metadata": {},
+ "source": [
+ "In addition to deleting by interval, you can delete segments by using
`segment_id`. Let's load in some new data to work with.\n",
Review Comment:
I grouped deleting by interval and table together since they both use the
same base method, passing a time interval.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]