317brian commented on code in PR #14781:
URL: https://github.com/apache/druid/pull/14781#discussion_r1287768762
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
Review Comment:
Is there a reason to say `Python value `true` isntead of just saying that it
returns true? The latter is more natural/human.
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
Review Comment:
```suggestion
"Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. To start, we will ingest the
quickstart Wikipedia data and partition it by hour to create multiple segments
.\n",
```
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
Review Comment:
Native batch is the legacy way to ingeest batch data. Use SQL-based
ingestion/MSQ
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ff9d098",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia_hour\",\n",
+ " \"timestampSpec\": {\n",
+ " \"column\": \"time\",\n",
+ " \"format\": \"iso\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"metricsSpec\": [],\n",
+ " \"granularitySpec\": {\n",
+ " \"type\": \"uniform\",\n",
+ " \"segmentGranularity\": \"hour\",\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"intervals\": [\n",
+ " \"2015-09-12/2015-09-13\"\n",
+ " ],\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " },\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"local\",\n",
+ " \"baseDir\": \"quickstart/tutorial/\",\n",
+ " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " },\n",
+ " \"appendToExisting\": False\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"maxRowsPerSegment\": 5000000,\n",
+ " \"maxRowsInMemory\": 25000\n",
+ " }\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "With the payload and headers ready, run the next cell to send a `POST`
request to the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ " \n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid will be populated with segments
for each segment interval that contains data. Since the `wikipedia_hour` was
ingested with `HOUR` granularity, there will be 24 segments associated with
`wikipedia_hour`. \n",
+ "\n",
+ "For demonstration, let's view the segments generated for the
`wikipedia_hour` datasource before any deletion is made. Run the following cell
to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint,
see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+ "\n",
+ "Using this endpoint, you can query the `sys` [metadata
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "956abeee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "701550dd",
+ "metadata": {},
+ "source": [
+ "Now, you can query the metadata table to retrieve segment information.
The following cell sends a SQL query to retrieve `segment_id` information for
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to
`objectLines`. This helps format the response with newlines and makes it easier
to parse the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb54a6b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ " \n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f06e24e5",
+ "metadata": {},
+ "source": [
+ "Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id`, each containing the datasource name `wikipedia_hour`,
along with the start and end hour interval. The tail end of the ID also
contains the timestamp of when the request was made. \n",
+ "\n",
+ "For this tutorial, we are concerned with observing the start and end
interval for each `segment_id`. \n",
+ "\n",
+ "For example: \n",
+
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
indicates this segment contains data from `2015-09-12T00:00:00.000` to
`2015-09-12T01:00:00.000Z`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca79f5f9",
+ "metadata": {},
+ "source": [
+ "## Deletion steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6cd1c8c",
+ "metadata": {},
+ "source": [
+ "Permanent deletion of a segment in Apache Druid has two steps:\n",
+ "\n",
+ "1. A segment is marked as \"unused.\" This step occurs when a segment is
dropped by a [drop
rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules)
or manually marked as \"unused\" through the Coordinator API or web console.
Note that marking a segment as \"unused\" is a soft delete, it is no longer
available for querying but the segment files remain in deep storage and segment
records remain in the metadata store. \n",
+ "2. A kill task is sent to permanently remove \"unused\" segments. This
deletes the segment file from deep storage and removes its record from the
metadata store. This is a hard delete: the data is unrecoverable unless you
have a backup."
Review Comment:
Convert these to active voice from passive
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ff9d098",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia_hour\",\n",
+ " \"timestampSpec\": {\n",
+ " \"column\": \"time\",\n",
+ " \"format\": \"iso\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"metricsSpec\": [],\n",
+ " \"granularitySpec\": {\n",
+ " \"type\": \"uniform\",\n",
+ " \"segmentGranularity\": \"hour\",\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"intervals\": [\n",
+ " \"2015-09-12/2015-09-13\"\n",
+ " ],\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " },\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"local\",\n",
+ " \"baseDir\": \"quickstart/tutorial/\",\n",
+ " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " },\n",
+ " \"appendToExisting\": False\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"maxRowsPerSegment\": 5000000,\n",
+ " \"maxRowsInMemory\": 25000\n",
+ " }\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "With the payload and headers ready, run the next cell to send a `POST`
request to the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ " \n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid will be populated with segments
for each segment interval that contains data. Since the `wikipedia_hour` was
ingested with `HOUR` granularity, there will be 24 segments associated with
`wikipedia_hour`. \n",
+ "\n",
+ "For demonstration, let's view the segments generated for the
`wikipedia_hour` datasource before any deletion is made. Run the following cell
to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint,
see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+ "\n",
+ "Using this endpoint, you can query the `sys` [metadata
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "956abeee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "701550dd",
+ "metadata": {},
+ "source": [
+ "Now, you can query the metadata table to retrieve segment information.
The following cell sends a SQL query to retrieve `segment_id` information for
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to
`objectLines`. This helps format the response with newlines and makes it easier
to parse the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb54a6b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ " \n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f06e24e5",
+ "metadata": {},
+ "source": [
+ "Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id`, each containing the datasource name `wikipedia_hour`,
along with the start and end hour interval. The tail end of the ID also
contains the timestamp of when the request was made. \n",
+ "\n",
+ "For this tutorial, we are concerned with observing the start and end
interval for each `segment_id`. \n",
+ "\n",
+ "For example: \n",
+
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
indicates this segment contains data from `2015-09-12T00:00:00.000` to
`2015-09-12T01:00:00.000Z`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca79f5f9",
+ "metadata": {},
+ "source": [
+ "## Deletion steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6cd1c8c",
+ "metadata": {},
+ "source": [
+ "Permanent deletion of a segment in Apache Druid has two steps:\n",
Review Comment:
```suggestion
"Permanent deletion of a segment in Druid has two steps:\n",
```
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ff9d098",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia_hour\",\n",
+ " \"timestampSpec\": {\n",
+ " \"column\": \"time\",\n",
+ " \"format\": \"iso\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"metricsSpec\": [],\n",
+ " \"granularitySpec\": {\n",
+ " \"type\": \"uniform\",\n",
+ " \"segmentGranularity\": \"hour\",\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"intervals\": [\n",
+ " \"2015-09-12/2015-09-13\"\n",
+ " ],\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " },\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"local\",\n",
+ " \"baseDir\": \"quickstart/tutorial/\",\n",
+ " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " },\n",
+ " \"appendToExisting\": False\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"maxRowsPerSegment\": 5000000,\n",
+ " \"maxRowsInMemory\": 25000\n",
+ " }\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "With the payload and headers ready, run the next cell to send a `POST`
request to the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ " \n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid will be populated with segments
for each segment interval that contains data. Since the `wikipedia_hour` was
ingested with `HOUR` granularity, there will be 24 segments associated with
`wikipedia_hour`. \n",
Review Comment:
This isn't necessarily always true. If I set the max rows per segment to 1,
you'd get way more than 24 total segments.
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ff9d098",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia_hour\",\n",
+ " \"timestampSpec\": {\n",
+ " \"column\": \"time\",\n",
+ " \"format\": \"iso\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"metricsSpec\": [],\n",
+ " \"granularitySpec\": {\n",
+ " \"type\": \"uniform\",\n",
+ " \"segmentGranularity\": \"hour\",\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"intervals\": [\n",
+ " \"2015-09-12/2015-09-13\"\n",
+ " ],\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " },\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"local\",\n",
+ " \"baseDir\": \"quickstart/tutorial/\",\n",
+ " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " },\n",
+ " \"appendToExisting\": False\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"maxRowsPerSegment\": 5000000,\n",
+ " \"maxRowsInMemory\": 25000\n",
+ " }\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "With the payload and headers ready, run the next cell to send a `POST`
request to the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ " \n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid will be populated with segments
for each segment interval that contains data. Since the `wikipedia_hour` was
ingested with `HOUR` granularity, there will be 24 segments associated with
`wikipedia_hour`. \n",
+ "\n",
+ "For demonstration, let's view the segments generated for the
`wikipedia_hour` datasource before any deletion is made. Run the following cell
to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint,
see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+ "\n",
+ "Using this endpoint, you can query the `sys` [metadata
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "956abeee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "701550dd",
+ "metadata": {},
+ "source": [
+ "Now, you can query the metadata table to retrieve segment information.
The following cell sends a SQL query to retrieve `segment_id` information for
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to
`objectLines`. This helps format the response with newlines and makes it easier
to parse the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb54a6b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ " \n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f06e24e5",
+ "metadata": {},
+ "source": [
+ "Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id`, each containing the datasource name `wikipedia_hour`,
along with the start and end hour interval. The tail end of the ID also
contains the timestamp of when the request was made. \n",
Review Comment:
```suggestion
"Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id` records, each containing the datasource name
`wikipedia_hour`, along with the start and end hour interval. The tail end of
the ID also contains the timestamp of when the request was made. \n",
```
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ff9d098",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia_hour\",\n",
+ " \"timestampSpec\": {\n",
+ " \"column\": \"time\",\n",
+ " \"format\": \"iso\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"metricsSpec\": [],\n",
+ " \"granularitySpec\": {\n",
+ " \"type\": \"uniform\",\n",
+ " \"segmentGranularity\": \"hour\",\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"intervals\": [\n",
+ " \"2015-09-12/2015-09-13\"\n",
+ " ],\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " },\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"local\",\n",
+ " \"baseDir\": \"quickstart/tutorial/\",\n",
+ " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " },\n",
+ " \"appendToExisting\": False\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"maxRowsPerSegment\": 5000000,\n",
+ " \"maxRowsInMemory\": 25000\n",
+ " }\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "With the payload and headers ready, run the next cell to send a `POST`
request to the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ " \n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid will be populated with segments
for each segment interval that contains data. Since the `wikipedia_hour` was
ingested with `HOUR` granularity, there will be 24 segments associated with
`wikipedia_hour`. \n",
+ "\n",
+ "For demonstration, let's view the segments generated for the
`wikipedia_hour` datasource before any deletion is made. Run the following cell
to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint,
see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+ "\n",
+ "Using this endpoint, you can query the `sys` [metadata
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "956abeee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "701550dd",
+ "metadata": {},
+ "source": [
+ "Now, you can query the metadata table to retrieve segment information.
The following cell sends a SQL query to retrieve `segment_id` information for
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to
`objectLines`. This helps format the response with newlines and makes it easier
to parse the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb54a6b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ " \n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f06e24e5",
+ "metadata": {},
+ "source": [
+ "Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id`, each containing the datasource name `wikipedia_hour`,
along with the start and end hour interval. The tail end of the ID also
contains the timestamp of when the request was made. \n",
+ "\n",
+ "For this tutorial, we are concerned with observing the start and end
interval for each `segment_id`. \n",
+ "\n",
+ "For example: \n",
+
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
indicates this segment contains data from `2015-09-12T00:00:00.000` to
`2015-09-12T01:00:00.000Z`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca79f5f9",
+ "metadata": {},
+ "source": [
+ "## Deletion steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6cd1c8c",
+ "metadata": {},
+ "source": [
+ "Permanent deletion of a segment in Apache Druid has two steps:\n",
+ "\n",
+ "1. A segment is marked as \"unused.\" This step occurs when a segment is
dropped by a [drop
rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules)
or manually marked as \"unused\" through the Coordinator API or web console.
Note that marking a segment as \"unused\" is a soft delete, it is no longer
available for querying but the segment files remain in deep storage and segment
records remain in the metadata store. \n",
+ "2. A kill task is sent to permanently remove \"unused\" segments. This
deletes the segment file from deep storage and removes its record from the
metadata store. This is a hard delete: the data is unrecoverable unless you
have a backup."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b9bc7f00",
+ "metadata": {},
+ "source": [
+ "## Delete by time interval"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1040bdaf",
+ "metadata": {},
+ "source": [
+ "Segments can be deleted in a specified time interval. This begins with
marking all segments in the interval as \"unused\", then sending a kill request
to delete it permanently from deep storage.\n",
+ "\n",
+ "First, set the endpoint variable to the Coordinator API endpoint
`/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the
datasource ingested is `wikipedia_hour`, let's specify that in the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9db8786d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host +
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "863576a9",
+ "metadata": {},
+ "source": [
+ "The following cell constructs a JSON payload with the interval of
segments to be deleted. This will mark the intervals from `18:00:00.000` to
`20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the
endpoint in a `POST` request."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "79387e72",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ "\n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "89e2fcb4",
+ "metadata": {},
+ "source": [
+ "The response from the above cell should return a JSON object with the
property `\"numChangedSegments\"` and the value `2`. This refers to the
following segments:\n",
+ "\n",
+ "*
`{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z\"}`\n",
+ "*
`{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z\"}`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e61cae23",
+ "metadata": {},
+ "source": [
+ "Next, verify that the segments have been soft deleted. The following cell
sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request
querying for the existing `segment_id`s. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ea7c0d26",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ "\n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "747bd12c",
+ "metadata": {},
+ "source": [
+ "Observe the response above. There should now be only 22 segments, and the
\"unused\" segments have been soft deleted. \n",
+ "\n",
+ "However, as you've only soft deleted the segments, it remains in deep
storage.\n",
+ "\n",
+ "Before permanently deleting the segments, let's observe how this can
change in deep storage. This step is optional, you can move onto the next set
of cells without completing this step."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "943b36cc",
+ "metadata": {},
+ "source": [
+ "[OPTIONAL] If you are running Druid externally from the Docker Compose
environment, follow these instructions to retrieve segments from deep
storage:\n",
Review Comment:
We're not really retrieving segments from deep storage here. We're just
ls'ing the filesystem where the segmetns are stored
##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "71bdcc40",
+ "metadata": {},
+ "source": [
+ "# Learn to delete data with Druid API\n",
+ "\n",
+ "<!--\n",
+ " ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+ " ~ or more contributor license agreements. See the NOTICE file\n",
+ " ~ distributed with this work for additional information\n",
+ " ~ regarding copyright ownership. The ASF licenses this file\n",
+ " ~ to you under the Apache License, Version 2.0 (the\n",
+ " ~ \"License\"); you may not use this file except in compliance\n",
+ " ~ with the License. You may obtain a copy of the License at\n",
+ " ~\n",
+ " ~ http://www.apache.org/licenses/LICENSE-2.0\n",
+ " ~\n",
+ " ~ Unless required by applicable law or agreed to in writing,\n",
+ " ~ software distributed under the License is distributed on an\n",
+ " ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+ " ~ KIND, either express or implied. See the License for the\n",
+ " ~ specific language governing permissions and limitations\n",
+ " ~ under the License.\n",
+ " -->\n",
+ "\n",
+ "In working with data, Druid retains a copies of the existing data
segments in deep storage and Historical processes. As new data is added into
Druid, deep storage grows and becomes larger over time unless explicitly
removed.\n",
+ "\n",
+ "While deep storage is an important part of Druid's elastic,
fault-tolerant design, over time, data accumulation in deep storage can lead to
increased storage costs. Periodically deleting data can reclaim storage space
and promote optimal resource allocation.\n",
+ "\n",
+ "This notebook provides a tutorial on deleting existing data in Druid
using the Coordinator API endpoints. \n",
+ "\n",
+ "## Table of contents\n",
+ "\n",
+ "- [Prerequisites](#Prerequisites)\n",
+ "- [Ingest data](#Ingest-data)\n",
+ "- [Deletion steps](#Deletion-steps)\n",
+ "- [Delete by time interval](#Delete-by-time-interval)\n",
+ "- [Delete entire table](#Delete-entire-table)\n",
+ "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+ "\n",
+ "For the best experience, use JupyterLab so that you can always access the
table of contents."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6fc260fc",
+ "metadata": {},
+ "source": [
+ "\n",
+ "## Prerequisites\n",
+ "\n",
+ "This tutorial works with Druid 26.0.0 or later.\n",
+ "\n",
+ "\n",
+ "Launch this tutorial and all prerequisites using the `druid-jupyter`,
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter
Notebook
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+ "\n",
+ "If you do not use the Docker Compose environment, you need the
following:\n",
+ "\n",
+ "* A running Druid instance.<br>\n",
+ " Update the `druid_host` variable to point to your Router endpoint.
For example:\n",
+ " ```\n",
+ " druid_host = \"http://localhost:8888\"\n",
+ " ```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7b8a7510",
+ "metadata": {},
+ "source": [
+ "To start this tutorial, run the next cell. It imports the Python packages
you'll need and defines a variable for the the Druid host, where the Router
service listens.\n",
+ "\n",
+ "`druid_host` is the hostname and port for your Druid deployment. In a
distributed environment, you can point to other Druid services. In this
tutorial, you'll use the Router service as the `druid_host`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ed52d809",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import requests\n",
+ "import json\n",
+ "\n",
+ "# druid_host is the hostname and port for your Druid deployment. \n",
+ "# In the Docker Compose tutorial environment, this is the Router\n",
+ "# service running at \"http://router:8888\".\n",
+ "# If you are not using the Docker Compose environment, edit the
`druid_host`.\n",
+ "\n",
+ "druid_host = \"http://host.docker.internal:8888\"\n",
+ "druid_host"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f3c9a92",
+ "metadata": {},
+ "source": [
+ "Before we proceed with the tutorial, let's use the `/status/health`
endpoint to verify that the cluster if up and running. This endpoint returns
the Python value `true` if the Druid cluster has finished starting up and is
running. Do not move on from this point if the following call does not return
`true`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "18a8a495",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/status/health'\n",
+ "response = requests.request(\"GET\", endpoint)\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "19144be9",
+ "metadata": {},
+ "source": [
+ "In the rest of this tutorial, the `endpoint` and other variables are
updated in code cells to call a different Druid endpoint to accomplish a task."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a281144",
+ "metadata": {},
+ "source": [
+ "## Ingest data\n",
+ "\n",
+ "Apache Druid stores data partitioned by time chunks into segments and
supports deleting data by dropping segments. Before dropping data, we will use
the quickstart Wikipedia data ingested with an indexing spec that creates
hourly segments.\n",
+ "\n",
+ "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "051655c9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02e4f551",
+ "metadata": {},
+ "source": [
+ "Next, construct a JSON payload with the ingestion specs to create a
`wikipedia_hour` datasource with hour segmentation. There are many different
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
to ingest data, this tutorial uses [native batch
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html)
and the `/druid/indexer/v1/task` endpoint. For more information on construction
an ingestion spec, see [ingestion spec
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9ff9d098",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"spec\": {\n",
+ " \"dataSchema\": {\n",
+ " \"dataSource\": \"wikipedia_hour\",\n",
+ " \"timestampSpec\": {\n",
+ " \"column\": \"time\",\n",
+ " \"format\": \"iso\"\n",
+ " },\n",
+ " \"dimensionsSpec\": {\n",
+ " \"useSchemaDiscovery\": True\n",
+ " },\n",
+ " \"metricsSpec\": [],\n",
+ " \"granularitySpec\": {\n",
+ " \"type\": \"uniform\",\n",
+ " \"segmentGranularity\": \"hour\",\n",
+ " \"queryGranularity\": \"none\",\n",
+ " \"intervals\": [\n",
+ " \"2015-09-12/2015-09-13\"\n",
+ " ],\n",
+ " \"rollup\": False\n",
+ " }\n",
+ " },\n",
+ " \"ioConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"inputSource\": {\n",
+ " \"type\": \"local\",\n",
+ " \"baseDir\": \"quickstart/tutorial/\",\n",
+ " \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+ " },\n",
+ " \"inputFormat\": {\n",
+ " \"type\": \"json\"\n",
+ " },\n",
+ " \"appendToExisting\": False\n",
+ " },\n",
+ " \"tuningConfig\": {\n",
+ " \"type\": \"index_parallel\",\n",
+ " \"maxRowsPerSegment\": 5000000,\n",
+ " \"maxRowsInMemory\": 25000\n",
+ " }\n",
+ " }\n",
+ "})\n",
+ "\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cf78bb7",
+ "metadata": {},
+ "source": [
+ "With the payload and headers ready, run the next cell to send a `POST`
request to the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "543b03ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ " \n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cab33e7e",
+ "metadata": {},
+ "source": [
+ "Once the data has been ingested, Druid will be populated with segments
for each segment interval that contains data. Since the `wikipedia_hour` was
ingested with `HOUR` granularity, there will be 24 segments associated with
`wikipedia_hour`. \n",
+ "\n",
+ "For demonstration, let's view the segments generated for the
`wikipedia_hour` datasource before any deletion is made. Run the following cell
to set the endpoint to `/druid/v2/sql/`. For more information on this endpoint,
see [Druid SQL
API](https://druid.apache.org/docs/latest/querying/sql-api.html).\n",
+ "\n",
+ "Using this endpoint, you can query the `sys` [metadata
table](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html#system-schema)."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "956abeee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "701550dd",
+ "metadata": {},
+ "source": [
+ "Now, you can query the metadata table to retrieve segment information.
The following cell sends a SQL query to retrieve `segment_id` information for
the `wikipedia_hour` datasource. This tutorial sets the `resultFormat` to
`objectLines`. This helps format the response with newlines and makes it easier
to parse the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bb54a6b7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ " \n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f06e24e5",
+ "metadata": {},
+ "source": [
+ "Observe the response retrieved from the previous cell. In total, there
are 24 `segment_id`, each containing the datasource name `wikipedia_hour`,
along with the start and end hour interval. The tail end of the ID also
contains the timestamp of when the request was made. \n",
+ "\n",
+ "For this tutorial, we are concerned with observing the start and end
interval for each `segment_id`. \n",
+ "\n",
+ "For example: \n",
+
"`{\"segment_id\":\"wikipedia_hour_2015-09-12T00:00:00.000Z_2015-09-12T01:00:00.000Z_2023-08-07T21:36:29.244Z\"}`
indicates this segment contains data from `2015-09-12T00:00:00.000` to
`2015-09-12T01:00:00.000Z`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ca79f5f9",
+ "metadata": {},
+ "source": [
+ "## Deletion steps"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b6cd1c8c",
+ "metadata": {},
+ "source": [
+ "Permanent deletion of a segment in Apache Druid has two steps:\n",
+ "\n",
+ "1. A segment is marked as \"unused.\" This step occurs when a segment is
dropped by a [drop
rule](https://druid.apache.org/docs/latest/operations/rule-configuration.html#set-retention-rules)
or manually marked as \"unused\" through the Coordinator API or web console.
Note that marking a segment as \"unused\" is a soft delete, it is no longer
available for querying but the segment files remain in deep storage and segment
records remain in the metadata store. \n",
+ "2. A kill task is sent to permanently remove \"unused\" segments. This
deletes the segment file from deep storage and removes its record from the
metadata store. This is a hard delete: the data is unrecoverable unless you
have a backup."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b9bc7f00",
+ "metadata": {},
+ "source": [
+ "## Delete by time interval"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1040bdaf",
+ "metadata": {},
+ "source": [
+ "Segments can be deleted in a specified time interval. This begins with
marking all segments in the interval as \"unused\", then sending a kill request
to delete it permanently from deep storage.\n",
+ "\n",
+ "First, set the endpoint variable to the Coordinator API endpoint
`/druid/coordinator/v1/datasources/:dataSource/markUnused`. Since the
datasource ingested is `wikipedia_hour`, let's specify that in the endpoint."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "9db8786d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host +
'/druid/coordinator/v1/datasources/wikipedia_hour/markUnused'\n",
+ "endpoint"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "863576a9",
+ "metadata": {},
+ "source": [
+ "The following cell constructs a JSON payload with the interval of
segments to be deleted. This will mark the intervals from `18:00:00.000` to
`20:00:00.000` non-inclusive as \"unused.\" This payload is sent to the
endpoint in a `POST` request."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "79387e72",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "payload = json.dumps({\n",
+ " \"interval\": \"2015-09-12T18:00:00.000Z/2015-09-12T20:00:00.000Z\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ "\n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "89e2fcb4",
+ "metadata": {},
+ "source": [
+ "The response from the above cell should return a JSON object with the
property `\"numChangedSegments\"` and the value `2`. This refers to the
following segments:\n",
+ "\n",
+ "*
`{\"segment_id\":\"wikipedia_hour_2015-09-12T18:00:00.000Z_2015-09-12T19:00:00.000Z_2023-08-07T21:36:29.244Z\"}`\n",
+ "*
`{\"segment_id\":\"wikipedia_hour_2015-09-12T19:00:00.000Z_2015-09-12T20:00:00.000Z_2023-08-07T21:36:29.244Z\"}`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e61cae23",
+ "metadata": {},
+ "source": [
+ "Next, verify that the segments have been soft deleted. The following cell
sets the endpoint variable to `/druid/v2/sql` and sends a `POST` request
querying for the existing `segment_id`s. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ea7c0d26",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "endpoint = druid_host + '/druid/v2/sql'\n",
+ "payload = json.dumps({\n",
+ " \"query\": \"SELECT segment_id FROM sys.segments WHERE
\\\"datasource\\\" = 'wikipedia_hour'\",\n",
+ " \"resultFormat\": \"objectLines\"\n",
+ "})\n",
+ "headers = {\n",
+ " 'Content-Type': 'application/json'\n",
+ "}\n",
+ "\n",
+ "response = requests.request(\"POST\", endpoint, headers=headers,
data=payload)\n",
+ "\n",
+ "print(response.text)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "747bd12c",
+ "metadata": {},
+ "source": [
+ "Observe the response above. There should now be only 22 segments, and the
\"unused\" segments have been soft deleted. \n",
+ "\n",
+ "However, as you've only soft deleted the segments, it remains in deep
storage.\n",
+ "\n",
+ "Before permanently deleting the segments, let's observe how this can
change in deep storage. This step is optional, you can move onto the next set
of cells without completing this step."
Review Comment:
```suggestion
"Before permanently deleting the segments, you can verify that they've
only been soft deleted by inspecting your deep storage. The soft deleted
segments are still there. This step is optional, you can move onto the next set
of cells without completing this step."
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]