ektravel commented on code in PR #16081:
URL: https://github.com/apache/druid/pull/16081#discussion_r1518720607


##########
docs/tutorials/tutorial-update-data.md:
##########
@@ -23,161 +23,236 @@ sidebar_label: Update existing data
   ~ under the License.
   -->
 
+:::info
 
-This tutorial shows you how to update data in a datasource by overwriting 
existing data and adding new data to the datasource.
+Apache Druid supports two query languages: [Druid SQL](../querying/sql.md) and 
[native queries](../querying/querying.md).
+This document describes the SQL language.
+
+:::
+
+Apache Druid stores data and indexes in [segment files](../design/segments.md) 
partitioned by time.
+Once segments are created, they cannot be modified.
+To update existing data, you must rebuild and republish the segments.
+Druid updates existing data using time ranges, not primary key. Data outside 
the replacement time range is not touched.
+
+This tutorial shows you how to use the Druid SQL 
[REPLACE](../multi-stage-query/reference.md#replace) function with the 
OVERWRITE clause to update existing data.
+
+The tutorial walks you through the following use cases:
+
+* [Overwrite all data](#overwrite-all-data)
+* [Overwrite a specific row](#overwrite-a-specific-row)
+* [Update a row using partial segment 
overshadowing](#update-a-row-using-partial-segment-overshadowing)
+
+All examples use the [multi-stage query (MSQ)](../multi-stage-query/index.md) 
task engine to executes SQL statements.
 
 ## Prerequisites
 
-Before starting this tutorial, download and run Apache Druid on your local 
machine as described in
-the [single-machine quickstart](index.md).
+Before you follow the steps in this tutorial, download Druid as described in 
[Quickstart (local)](index.md) and have it running on your local machine. You 
don't need to load any data into the Druid cluster.
 
-You should also be familiar with the material in the following tutorials:
-* [Load a file](../tutorials/tutorial-batch.md)
-* [Query data](../tutorials/tutorial-query.md)
-* [Rollup](../tutorials/tutorial-rollup.md)
+You should be familiar with data querying in Druid. If you haven't already, go 
through the [Query data](../tutorials/tutorial-query.md) tutorial first.
 
 ## Load initial data
 
-Load an initial data set to which you will overwrite and append data.
+Load an initial dataset using 
[REPLACE](../multi-stage-query/reference.md#replace) and 
[EXTERN](../multi-stage-query/reference.md#extern-function) functions.
+In Druid SQL, the REPLACE function can create a new 
[datasource](../design/storage.md) or update an existing datasource.
+
+In the [web console](../operations/web-console.md), go to the **Query** view 
and run the following query:
+
+```sql
+REPLACE INTO "update_tutorial" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+     
'{"type":"inline","data":"{\"timestamp\":\"2024-01-01T07:01:35Z\",\"animal\":\"octopus\",
 
\"number\":115}\n{\"timestamp\":\"2024-01-01T05:01:35Z\",\"animal\":\"mongoose\",
 \"number\":737}\n{\"timestamp\":\"2024-01-01T06:01:35Z\",\"animal\":\"snake\", 
\"number\":1234}\n{\"timestamp\":\"2024-01-01T01:01:35Z\",\"animal\":\"lion\", 
\"number\":300}\n{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"seahorse\",
 \"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"skunk\", 
\"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"iguana\", 
\"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"opossum\",
 \"number\":300}"}',
+     '{"type":"json"}'
+    )
+  ) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT)
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "animal",
+  "number"
+FROM "ext"
+PARTITIONED BY DAY
 
-The ingestion spec is located at 
`quickstart/tutorial/updates-init-index.json`. This spec creates a datasource 
called `updates-tutorial` and ingests data from 
`quickstart/tutorial/updates-data.json`.
+```
 
-Submit the ingestion task:
+In the resulting `update_tutorial` datasource, individual rows are uniquely 
identified by `__time`, `animal`, and `number`.
+To view the results, open a new tab and run the following query:
 
-```bash
-bin/post-index-task --file quickstart/tutorial/updates-init-index.json --url 
http://localhost:8081
+```sql
+SELECT * FROM "update_tutorial"
 ```
 
-Start the SQL command-line client:
-```bash
-bin/dsql
+<details>
+<summary> View the results</summary>
+
+| `__time` | `animal` | `number`|
+| -- | -- | -- |
+| `2024-01-01T01:01:35.000Z`| `lion`| 300 |
+| `2024-01-01T05:01:35.000Z`| `mongoose`| 737 |
+| `2024-01-01T06:01:35.000Z`| `snake`| 1234 |
+| `2024-01-01T07:01:35.000Z`| `octopus`| 115 |
+| `2024-01-02T01:01:35.000Z`| `opossum`| 300 |
+| `2024-01-02T05:01:35.000Z`| `skunk`| 737 |
+| `2024-01-02T06:01:35.000Z`| `iguana`| 1234 |
+| `2024-01-02T07:01:35.000Z`| `seahorse`| 115 |
+
+</details>
+
+The results contain records for eight animals over two days.
+
+## Overwrite all data
+
+You can use the REPLACE function with OVERWRITE ALL to replace the entire 
datasource with new data while dropping the old data.
+
+In the web console, open a new tab and run the following query to overwrite 
timestamp data for the entire `update_tutorial` datasource:
+
+```sql
+REPLACE INTO "update_tutorial" OVERWRITE ALL
+WITH "ext" AS (SELECT *
+FROM TABLE(
+  EXTERN(
+    
'{"type":"inline","data":"{\"timestamp\":\"2024-01-02T07:01:35Z\",\"animal\":\"octopus\",
 
\"number\":115}\n{\"timestamp\":\"2024-01-02T05:01:35Z\",\"animal\":\"mongoose\",
 \"number\":737}\n{\"timestamp\":\"2024-01-02T06:01:35Z\",\"animal\":\"snake\", 
\"number\":1234}\n{\"timestamp\":\"2024-01-02T01:01:35Z\",\"animal\":\"lion\", 
\"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\",
 \"number\":115}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"skunk\", 
\"number\":737}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\", 
\"number\":1234}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"opossum\",
 \"number\":300}"}',
+    '{"type":"json"}'
+  )
+) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT))
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "animal",
+  "number"
+FROM "ext"
+PARTITIONED BY DAY
 ```
 
-Run the following SQL query to retrieve data from `updates-tutorial`:
-
-```bash
-dsql> SELECT * FROM "updates-tutorial";
-┌──────────────────────────┬──────────┬───────┬────────┐
-│ __time                   │ animal   │ count │ number │
-├──────────────────────────┼──────────┼───────┼────────┤
-│ 2018-01-01T01:01:00.000Z │ tiger    │     1 │    100 │
-│ 2018-01-01T03:01:00.000Z │ aardvark │     1 │     42 │
-│ 2018-01-01T03:01:00.000Z │ giraffe  │     1 │  14124 │
-└──────────────────────────┴──────────┴───────┴────────┘
-Retrieved 3 rows in 1.42s.
+<details>
+<summary> View the results</summary>
+
+| `__time` | `animal` | `number`|
+| -- | -- | -- |
+| `2024-01-02T01:01:35.000Z`| `lion`| 300 |
+| `2024-01-02T05:01:35.000Z`| `mongoose`| 737 |
+| `2024-01-02T06:01:35.000Z`| `snake`| 1234 |
+| `2024-01-02T07:01:35.000Z`| `octopus`| 115 |
+| `2024-01-03T01:01:35.000Z`| `opossum`| 300 |
+| `2024-01-03T05:01:35.000Z`| `skunk`| 737 |
+| `2024-01-03T06:01:35.000Z`| `iguana`| 1234 |
+| `2024-01-03T07:01:35.000Z`| `seahorse`| 115 |
+
+</details>
+
+Note that the values in the `__time` column have changed to one day later.
+
+## Overwrite a specific row
+
+You can use the REPLACE function to overwrite a specific time range of a 
datasource. When you overwrite a specific time range, that time range must 
align with the granularity specified in the PARTITIONED BY clause.
+
+In the web console, open a new tab and run the following query to insert a new 
row and update specific rows. Note that the OVERWRITE WHERE clause tells the 
query to only update records for the date 2024-01-03.
+
+```sql
+REPLACE INTO "update_tutorial" 
+  OVERWRITE WHERE "__time" >= TIMESTAMP'2024-01-03 00:00:00' AND "__time" < 
TIMESTAMP'2024-01-04 00:00:00'
+WITH "ext" AS (SELECT *
+FROM TABLE(
+  EXTERN(
+    
'{"type":"inline","data":"{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"tiger\",
 
\"number\":300}\n{\"timestamp\":\"2024-01-03T07:01:35Z\",\"animal\":\"seahorse\",
 
\"number\":500}\n{\"timestamp\":\"2024-01-03T05:01:35Z\",\"animal\":\"polecat\",
 
\"number\":626}\n{\"timestamp\":\"2024-01-03T06:01:35Z\",\"animal\":\"iguana\", 
\"number\":300}\n{\"timestamp\":\"2024-01-03T01:01:35Z\",\"animal\":\"flamingo\",
 \"number\":999}"}',
+    '{"type":"json"}'
+  )
+) EXTEND ("timestamp" VARCHAR, "animal" VARCHAR, "number" BIGINT))
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "animal",
+  "number"
+FROM "ext"
+PARTITIONED BY DAY
 ```
 
-The datasource contains three rows of data with an `animal` dimension and a 
`number` metric.
+<details>
+<summary> View the results</summary>
 
-## Overwrite data
+| `__time` | `animal` | `number`|
+| -- | -- | -- |
+| `2024-01-02T01:01:35.000Z`| `lion`| 300 |
+| `2024-01-02T05:01:35.000Z`| `mongoose`| 737 |
+| `2024-01-02T06:01:35.000Z`| `snake`| 1234 |
+| `2024-01-02T07:01:35.000Z`| `octopus`| 115 |
+| `2024-01-03T01:01:35.000Z`| `flamingo`| 999 |
+| `2024-01-03T01:01:35.000Z`| `tiger`| 300 |
+| `2024-01-03T05:01:35.000Z`| `polecat`| 626 |
+| `2024-01-03T06:01:35.000Z`| `iguana`| 300 |
+| `2024-01-03T07:01:35.000Z`| `seahorse`| 500 |
 
-To overwrite the data, submit another task for the same interval but with 
different input data.
+</details>
 
-The `quickstart/tutorial/updates-overwrite-index.json` spec performs an 
overwrite on the `updates-tutorial` datasource.
+Note the changes in the resulting datasource:
 
-In the overwrite ingestion spec, notice the following:
-* The `intervals` field remains the same: `"intervals" : 
["2018-01-01/2018-01-03"]`
-* New data is loaded from the local file, 
`quickstart/tutorial/updates-data2.json`
-* `appendToExisting` is set to `false`, indicating an overwrite task
+* There is now a new row called `flamingo`.
+* The `opossum` row has the value `tiger`.
+* The `skunk` row has the value `polecat`.
+* The `iguana` and `seahorse` rows have different numbers.
 
-Submit the ingestion task to overwrite the data:
+## Update a row using partial segment overshadowing
 
-```bash
-bin/post-index-task --file quickstart/tutorial/updates-overwrite-index.json 
--url http://localhost:8081
-```
+In Druid, you can overlay older data with newer data for the entire segment or 
portions of the segment within a particular partition.
+This capability is commonly referred to as 
[overshadowing](../ingestion/tasks.md#overshadowing-between-segments).
 
-When Druid finishes loading the new segment from this overwrite task, run the 
SELECT query again.
-In the new results, the `tiger` row now has the value `lion`, the `aardvark` 
row has a different number, and the `giraffe` row has been replaced with a 
`bear` row.
-
-```bash
-dsql> SELECT * FROM "updates-tutorial";
-┌──────────────────────────┬──────────┬───────┬────────┐
-│ __time                   │ animal   │ count │ number │
-├──────────────────────────┼──────────┼───────┼────────┤
-│ 2018-01-01T01:01:00.000Z │ lion     │     1 │    100 │
-│ 2018-01-01T03:01:00.000Z │ aardvark │     1 │   9999 │
-│ 2018-01-01T04:01:00.000Z │ bear     │     1 │    111 │
-└──────────────────────────┴──────────┴───────┴────────┘
-Retrieved 3 rows in 0.02s.
-```
+:::info
 
-## Combine existing data with new data and overwrite
+The overshadow relation holds only for the same time chunk and the same 
datasource.

Review Comment:
   I'll remove it since [Overshadowing between 
segments](https://druid.apache.org/docs/latest/ingestion/tasks#overshadowing-between-segments)
 covers it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to