Re: [PR] Jupyter notebook tutorial - Delete API (druid)

via GitHub Tue, 08 Aug 2023 16:20:14 -0700


317brian commented on code in PR #14781:
URL: https://github.com/apache/druid/pull/14781#discussion_r1287770717



##########
examples/quickstart/jupyter-notebooks/notebooks/04-api/01-delete-api-tutorial.ipynb:
##########
@@ -0,0 +1,938 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "71bdcc40",
+   "metadata": {},
+   "source": [
+    "# Learn to delete data with Druid API\n",
+    "\n",
+    "<!--\n",
+    "  ~ Licensed to the Apache Software Foundation (ASF) under one\n",
+    "  ~ or more contributor license agreements.  See the NOTICE file\n",
+    "  ~ distributed with this work for additional information\n",
+    "  ~ regarding copyright ownership.  The ASF licenses this file\n",
+    "  ~ to you under the Apache License, Version 2.0 (the\n",
+    "  ~ \"License\"); you may not use this file except in compliance\n",
+    "  ~ with the License.  You may obtain a copy of the License at\n",
+    "  ~\n",
+    "  ~   http://www.apache.org/licenses/LICENSE-2.0\n";,
+    "  ~\n",
+    "  ~ Unless required by applicable law or agreed to in writing,\n",
+    "  ~ software distributed under the License is distributed on an\n",
+    "  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
+    "  ~ KIND, either express or implied.  See the License for the\n",
+    "  ~ specific language governing permissions and limitations\n",
+    "  ~ under the License.\n",
+    "  -->\n",
+    "\n",
+    "In working with data, Druid retains a copies of the existing data 
segments in deep storage and Historical processes. As new data is added into 
Druid, deep storage grows and becomes larger over time unless explicitly 
removed.\n",
+    "\n",
+    "While deep storage is an important part of Druid's elastic, 
fault-tolerant design, over time, data accumulation in deep storage can lead to 
increased storage costs. Periodically deleting data can reclaim storage space 
and promote optimal resource allocation.\n",
+    "\n",
+    "This notebook provides a tutorial on deleting existing data in Druid 
using the Coordinator API endpoints. \n",
+    "\n",
+    "## Table of contents\n",
+    "\n",
+    "- [Prerequisites](#Prerequisites)\n",
+    "- [Ingest data](#Ingest-data)\n",
+    "- [Deletion steps](#Deletion-steps)\n",
+    "- [Delete by time interval](#Delete-by-time-interval)\n",
+    "- [Delete entire table](#Delete-entire-table)\n",
+    "- [Delete by segment ID](#Delete-by-segment-ID)\n",
+    "\n",
+    "For the best experience, use JupyterLab so that you can always access the 
table of contents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fc260fc",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "This tutorial works with Druid 26.0.0 or later.\n",
+    "\n",
+    "\n",
+    "Launch this tutorial and all prerequisites using the `druid-jupyter`, 
`kafka-jupyter`, or `all-services` profiles of the Docker Compose file for 
Jupyter-based Druid tutorials. For more information, see [Docker for Jupyter 
Notebook 
tutorials](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-docker.html).\n",
+    "\n",
+    "If you do not use the Docker Compose environment, you need the 
following:\n",
+    "\n",
+    "* A running Druid instance.<br>\n",
+    "     Update the `druid_host` variable to point to your Router endpoint. 
For example:\n",
+    "     ```\n",
+    "     druid_host = \"http://localhost:8888\"\n";,
+    "     ```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b8a7510",
+   "metadata": {},
+   "source": [
+    "To start this tutorial, run the next cell. It imports the Python packages 
you'll need and defines a variable for the the Druid host, where the Router 
service listens.\n",
+    "\n",
+    "`druid_host` is the hostname and port for your Druid deployment. In a 
distributed environment, you can point to other Druid services. In this 
tutorial, you'll use the Router service as the `druid_host`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ed52d809",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import json\n",
+    "\n",
+    "# druid_host is the hostname and port for your Druid deployment. \n",
+    "# In the Docker Compose tutorial environment, this is the Router\n",
+    "# service running at \"http://router:8888\".\n";,
+    "# If you are not using the Docker Compose environment, edit the 
`druid_host`.\n",
+    "\n",
+    "druid_host = \"http://host.docker.internal:8888\"\n";,
+    "druid_host"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f3c9a92",
+   "metadata": {},
+   "source": [
+    "Before we proceed with the tutorial, let's use the `/status/health` 
endpoint to verify that the cluster if up and running. This endpoint returns 
the Python value `true` if the Druid cluster has finished starting up and is 
running. Do not move on from this point if the following call does not return 
`true`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18a8a495",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/status/health'\n",
+    "response = requests.request(\"GET\", endpoint)\n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19144be9",
+   "metadata": {},
+   "source": [
+    "In the rest of this tutorial, the `endpoint` and other variables are 
updated in code cells to call a different Druid endpoint to accomplish a task."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a281144",
+   "metadata": {},
+   "source": [
+    "## Ingest data\n",
+    "\n",
+    "Apache Druid stores data partitioned by time chunks into segments and 
supports deleting data by dropping segments. Before dropping data, we will use 
the quickstart Wikipedia data ingested with an indexing spec that creates 
hourly segments.\n",
+    "\n",
+    "The following cell sets `endpoint` to `/druid/indexer/v1/task`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "051655c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "endpoint = druid_host + '/druid/indexer/v1/task'\n",
+    "endpoint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02e4f551",
+   "metadata": {},
+   "source": [
+    "Next, construct a JSON payload with the ingestion specs to create a 
`wikipedia_hour` datasource with hour segmentation. There are many different 
[methods](https://druid.apache.org/docs/latest/ingestion/index.html#ingestion-methods)
 to ingest data, this tutorial uses [native batch 
ingestion](https://druid.apache.org/docs/latest/ingestion/native-batch.html) 
and the `/druid/indexer/v1/task` endpoint. For more information on construction 
an ingestion spec, see [ingestion spec 
reference](https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9ff9d098",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "payload = json.dumps({\n",
+    "  \"type\": \"index_parallel\",\n",
+    "  \"spec\": {\n",
+    "    \"dataSchema\": {\n",
+    "      \"dataSource\": \"wikipedia_hour\",\n",
+    "      \"timestampSpec\": {\n",
+    "        \"column\": \"time\",\n",
+    "        \"format\": \"iso\"\n",
+    "      },\n",
+    "      \"dimensionsSpec\": {\n",
+    "        \"useSchemaDiscovery\": True\n",
+    "      },\n",
+    "      \"metricsSpec\": [],\n",
+    "      \"granularitySpec\": {\n",
+    "        \"type\": \"uniform\",\n",
+    "        \"segmentGranularity\": \"hour\",\n",
+    "        \"queryGranularity\": \"none\",\n",
+    "        \"intervals\": [\n",
+    "          \"2015-09-12/2015-09-13\"\n",
+    "        ],\n",
+    "        \"rollup\": False\n",
+    "      }\n",
+    "    },\n",
+    "    \"ioConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"inputSource\": {\n",
+    "        \"type\": \"local\",\n",
+    "        \"baseDir\": \"quickstart/tutorial/\",\n",
+    "        \"filter\": \"wikiticker-2015-09-12-sampled.json.gz\"\n",
+    "      },\n",
+    "      \"inputFormat\": {\n",
+    "        \"type\": \"json\"\n",
+    "      },\n",
+    "      \"appendToExisting\": False\n",
+    "    },\n",
+    "    \"tuningConfig\": {\n",
+    "      \"type\": \"index_parallel\",\n",
+    "      \"maxRowsPerSegment\": 5000000,\n",
+    "      \"maxRowsInMemory\": 25000\n",
+    "    }\n",
+    "  }\n",
+    "})\n",
+    "\n",
+    "headers = {\n",
+    "  'Content-Type': 'application/json'\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cf78bb7",
+   "metadata": {},
+   "source": [
+    "With the payload and headers ready, run the next cell to send a `POST` 
request to the endpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "543b03ee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = requests.request(\"POST\", endpoint, headers=headers, 
data=payload)\n",
+    "                            \n",
+    "print(response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cab33e7e",
+   "metadata": {},
+   "source": [
+    "Once the data has been ingested, Druid will be populated with segments 
for each segment interval that contains data. Since the `wikipedia_hour` was 
ingested with `HOUR` granularity, there will be 24 segments associated with 
`wikipedia_hour`. \n",

Review Comment:
   This isn't necessarily always true. If I set the max rows per segment to 1, 
you'd get way more than 24 total segments. So the total number of segmetns is 
related to the granularity but also the max rows per segment.
   
   So just say something about how they should see 24 segments. 
   
   (The fact that there are 24 segments and 24 hours a day and we partitioned 
by 24 is a coincidence)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Jupyter notebook tutorial - Delete API (druid)

Reply via email to