gianm commented on issue #11929:
URL: https://github.com/apache/druid/issues/11929#issuecomment-1023654380


   PR #12163 is adding the PARTITON BY / CLUSTER BY syntax. That's great. I'd 
like to propose some additional syntax for controlling replace vs append, like 
what we do with `intervals` and `appendToExisting` in ingestion specs today.
   
   The idea is based on Spark's `INSERT OVERWRITE`, which is a similar concept: 
https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-overwrite-table.html.
 However, I suggest a few changes for our situation:
   
   1) REPLACE is already a keyword in our dialect, so use that instead of 
OVERWRITE to avoid adding a new keyword.
   2) We only support time partitioning, so include some nice nods to that in 
the syntax.
   3) We don't have "partition by" metadata on the table, so it'd have to be 
provided in the query.
   
   For example, this query would overwrite the entire datasource `tbl`:
   
   ```
   INSERT INTO tbl
   REPLACE ALL
     SELECT …
     PARTITION BY DAY
   ```
   
   This query would overwrite the day 2020-01-01:
   
   ```
   INSERT INTO tbl
   REPLACE PARTITION TIMESTAMP '2020-01-01 00:00:00' -- SQL literal timestamp
     SELECT …
     PARTITION BY DAY
   ```
   
   This query would overwrite the month 2020-01. Note that in this example, the 
partition interval spans multiple partitions. I think this is OK and will be 
useful for jobs that want to replace a bunch of time chunks all at once. I wish 
it would say PARTITIONS instead of PARTITION, but PARTITIONS isn't a keyword 
currently and I'd like to avoid adding a new one.
   
   ```
   INSERT INTO tbl
   REPLACE PARTITION '2020-01/P1M' -- ISO8601 interval
     SELECT …
     PARTITION BY DAY
   ```
   
   Validation conditions:
   
   1. With `REPLACE PARTITION TIMESTAMP '<sql timestamp>'`, the timestamp must 
be aligned to the provided `PARTITION BY` (it must be the starting timestamp of 
a time chunk).
   2. With `REPLACE PARTITION '<iso8601 interval>'`, the interval must be 
aligned to the provided `PARTITION BY`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to