ektravel commented on code in PR #16081:
URL: https://github.com/apache/druid/pull/16081#discussion_r1518721079
##########
docs/tutorials/tutorial-update-data.md:
##########
@@ -23,161 +23,236 @@ sidebar_label: Update existing data
~ under the License.
-->
+:::info
-This tutorial shows you how to update data in a datasource by overwriting
existing data and adding new data to the datasource.
+Apache Druid supports two query languages: [Druid SQL](../querying/sql.md) and
[native queries](../querying/querying.md).
+This document describes the SQL language.
+
+:::
+
+Apache Druid stores data and indexes in [segment files](../design/segments.md)
partitioned by time.
+Once segments are created, they cannot be modified.
+To update existing data, you must rebuild and republish the segments.
+Druid updates existing data using time ranges, not primary key. Data outside
the replacement time range is not touched.
+
+This tutorial shows you how to use the Druid SQL
[REPLACE](../multi-stage-query/reference.md#replace) function with the
OVERWRITE clause to update existing data.
+
+The tutorial walks you through the following use cases:
+
+* [Overwrite all data](#overwrite-all-data)
+* [Overwrite a specific row](#overwrite-a-specific-row)
+* [Update a row using partial segment
overshadowing](#update-a-row-using-partial-segment-overshadowing)
+
+All examples use the [multi-stage query (MSQ)](../multi-stage-query/index.md)
task engine to executes SQL statements.
## Prerequisites
-Before starting this tutorial, download and run Apache Druid on your local
machine as described in
-the [single-machine quickstart](index.md).
+Before you follow the steps in this tutorial, download Druid as described in
[Quickstart (local)](index.md) and have it running on your local machine. You
don't need to load any data into the Druid cluster.
-You should also be familiar with the material in the following tutorials:
-* [Load a file](../tutorials/tutorial-batch.md)
-* [Query data](../tutorials/tutorial-query.md)
-* [Rollup](../tutorials/tutorial-rollup.md)
+You should be familiar with data querying in Druid. If you haven't already, go
through the [Query data](../tutorials/tutorial-query.md) tutorial first.
## Load initial data
-Load an initial data set to which you will overwrite and append data.
+Load an initial dataset using
[REPLACE](../multi-stage-query/reference.md#replace) and
[EXTERN](../multi-stage-query/reference.md#extern-function) functions.
+In Druid SQL, the REPLACE function can create a new
[datasource](../design/storage.md) or update an existing datasource.
+
+In the [web console](../operations/web-console.md), go to the **Query** view
and run the following query:
+
+```sql
+REPLACE INTO "update_tutorial" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+
'{"type":"inline","data":"{\"timestamp\":\"2024-01-01T07:01:35Z\",\"animal\":\"octopus\",
\"number\":115}\n{\"timestamp\":\"2024-01-01T05:01:35Z\",\"animal\":\"mongoose\",
\"number\":737}\n{\"timestamp\":\"2024-01-01T06:01:35Z\",\"animal\":\"snake\",
\"number\":1234}\n{\"timestamp\":\"2024-01-01T01:01:35Z\",\"animal\":\"lion\",
\"number\":300}\n{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"seahorse\",
\"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"skunk\",
\"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"iguana\",
\"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"opossum\",
\"number\":300}"}',
+ '{"type":"json"}'
+ )
+ ) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT)
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "animal",
+ "number"
+FROM "ext"
+PARTITIONED BY DAY
-The ingestion spec is located at
`quickstart/tutorial/updates-init-index.json`. This spec creates a datasource
called `updates-tutorial` and ingests data from
`quickstart/tutorial/updates-data.json`.
+```
-Submit the ingestion task:
+In the resulting `update_tutorial` datasource, individual rows are uniquely
identified by `__time`, `animal`, and `number`.
+To view the results, open a new tab and run the following query:
-```bash
-bin/post-index-task --file quickstart/tutorial/updates-init-index.json --url
http://localhost:8081
+```sql
+SELECT * FROM "update_tutorial"
```
-Start the SQL command-line client:
-```bash
-bin/dsql
+<details>
+<summary> View the results</summary>
+
+| `__time` | `animal` | `number`|
+| -- | -- | -- |
+| `2024-01-01T01:01:35.000Z`| `lion`| 300 |
+| `2024-01-01T05:01:35.000Z`| `mongoose`| 737 |
+| `2024-01-01T06:01:35.000Z`| `snake`| 1234 |
+| `2024-01-01T07:01:35.000Z`| `octopus`| 115 |
+| `2024-01-02T01:01:35.000Z`| `opossum`| 300 |
+| `2024-01-02T05:01:35.000Z`| `skunk`| 737 |
+| `2024-01-02T06:01:35.000Z`| `iguana`| 1234 |
+| `2024-01-02T07:01:35.000Z`| `seahorse`| 115 |
+
+</details>
+
+The results contain records for eight animals over two days.
+
+## Overwrite all data
+
+You can use the REPLACE function with OVERWRITE ALL to replace the entire
datasource with new data while dropping the old data.
+
+In the web console, open a new tab and run the following query to overwrite
timestamp data for the entire `update_tutorial` datasource:
+
+```sql
+REPLACE INTO "update_tutorial" OVERWRITE ALL
+WITH "ext" AS (SELECT *
+FROM TABLE(
+ EXTERN(
+
'{"type":"inline","data":"{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"octopus\",
\"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"mongoose\",
\"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"snake\",
\"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"lion\",
\"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\",
\"number\":115}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"skunk\",
\"number\":737}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\",
\"number\":1234}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"opossum\",
\"number\":300}"}',
+ '{"type":"json"}'
+ )
+) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT))
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "animal",
+ "number"
+FROM "ext"
+PARTITIONED BY DAY
```
-Run the following SQL query to retrieve data from `updates-tutorial`:
-
-```bash
-dsql> SELECT * FROM "updates-tutorial";
-┌──────────────────────────┬──────────┬───────┬────────┐
-│ __time │ animal │ count │ number │
-├──────────────────────────┼──────────┼───────┼────────┤
-│ 2018-01-01T01:01:00.000Z │ tiger │ 1 │ 100 │
-│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 42 │
-│ 2018-01-01T03:01:00.000Z │ giraffe │ 1 │ 14124 │
-└──────────────────────────┴──────────┴───────┴────────┘
-Retrieved 3 rows in 1.42s.
+<details>
+<summary> View the results</summary>
+
+| `__time` | `animal` | `number`|
+| -- | -- | -- |
+| `2024-01-02T01:01:35.000Z`| `lion`| 300 |
+| `2024-01-02T05:01:35.000Z`| `mongoose`| 737 |
+| `2024-01-02T06:01:35.000Z`| `snake`| 1234 |
+| `2024-01-02T07:01:35.000Z`| `octopus`| 115 |
+| `2024-01-03T01:01:35.000Z`| `opossum`| 300 |
+| `2024-01-03T05:01:35.000Z`| `skunk`| 737 |
+| `2024-01-03T06:01:35.000Z`| `iguana`| 1234 |
+| `2024-01-03T07:01:35.000Z`| `seahorse`| 115 |
+
+</details>
+
+Note that the values in the `__time` column have changed to one day later.
+
+## Overwrite a specific row
Review Comment:
Maybe "Overwrite records for a specific time range"?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]