[GitHub] [druid] cryptoe commented on a diff in pull request #12983: msq: add multi-stage-query docs

GitBox Mon, 29 Aug 2022 11:00:12 -0700


cryptoe commented on code in PR #12983:
URL: https://github.com/apache/druid/pull/12983#discussion_r957611479



##########
docs/multi-stage-query/index.md:
##########
@@ -0,0 +1,324 @@
+---
+id: index
+title: SQL-based ingestion overview and syntax
+sidebar_label: Overview and syntax
+description: Introduces multi-stage query architecture and its features
+---
+
+> SQL-based ingestion and the multi-stage query task engine are experimental 
features available starting in Druid 24.0. You can use it in place of the 
existing native batch and Hadoop based ingestion systems. As an experimental 
feature, functionality documented on this page is subject to change or removal 
in future releases. Review the release notes and this page to stay up to date 
on changes.
+
+SQL-based ingestion for Apache Druid uses a distributed multi-stage query 
architecture, which includes a query engine called the multi-stage query task 
engine (MSQ task engine). The MSQ task engine extends Druid's query 
capabilities, so you can write queries that reference [external 
data](#read-external-data) as well as perform ingestion with SQL 
[INSERT](#insert-data) and [REPLACE](#replace-data). Essentially, you can 
perform SQL-based ingestion instead of using JSON ingestion specs that Druid's 
native ingestion uses.
+
+The MSQ task engine excels at executing queries that can get bottlenecked at 
the Broker when using Druid's native SQL engine. When you submit queries, the 
MSQ task engine splits them into stages and automatically exchanges data 
between stages. Each stage is parallelized to run across multiple data servers 
at once, simplifying performance.
+
+
+## MSQ task engine features
+
+In its current state, the MSQ task engine enables you to do the following:
+
+- Read external data at query time using EXTERN.
+- Execute batch ingestion jobs by writing SQL queries using INSERT and 
REPLACE. You no longer need to generate a JSON-based ingestion spec.
+- Transform and rewrite existing tables using SQL.
+- Perform multi-dimension range partitioning reliably, which leads to more 
evenly distributed segment sizes and better performance.
+
+The MSQ task engine has additional features that can be used as part of a 
proof of concept or demo, but don't use or rely on the following features for 
any meaningful use cases, especially production ones:
+
+- Execute heavy-weight queries and return large numbers of rows.
+- Execute queries that exchange large amounts of data between servers, like 
exact count distinct of high-cardinality fields.
+
+
+## Load the extension
+
+For new clusters that use 24.0 or later, the multi-stage query extension is 
loaded by default. If you want to add the extension to an existing cluster, add 
the extension `druid-multi-stage-query` to `druid.extensions.loadlist` in your 
`common.runtime.properties` file.
+
+For more information about how to load an extension, see [Loading 
extensions](../development/extensions.md#loading-extensions).
+
+To use EXTERN, you need READ permission on the resource named "EXTERNAL" of 
the resource type "EXTERNAL". If you encounter a 403 error when trying to use 
EXTERN, verify that you have the correct permissions.
+
+## MSQ task engine query syntax
+
+You can submit queries to the MSQ task engine through the **Query** view in 
the Druid console or through the API. The Druid console is a good place to 
start because you can preview a query before you run it. You can also 
experiment with many of the [context 
parameters](./msq-reference.md#context-parameters) through the UI. Once you're 
comfortable with submitting queries through the Druid console, [explore using 
the API to submit a query](./msq-api.md#submit-a-query).
+
+If you encounter an issue after you submit a query, you can learn more about 
what an error means from the [limits](./msq-concepts.md#limits) and 
[errors](./msq-concepts.md#error-codes). 
+
+Queries for the MSQ task engine involve three primary functions:
+
+- EXTERN to query external data
+- INSERT INTO ... SELECT to insert data, such as data from an external source
+- REPLACE to replace existing datasources, partially or fully, with query 
results
+
+For information about the syntax for queries, see [SQL 
syntax](./msq-reference.md#sql-syntax).
+
+### Read external data
+
+Query tasks can access external data through the EXTERN function. When using 
EXTERN, keep in mind that  large files do not get split across different worker 
tasks. If you have fewer input files than worker tasks, you can increase query 
parallelism by splitting up your input files such that you have at least one 
input file per worker task.
+
+You can use the EXTERN function anywhere a table is expected in the following 
form: `TABLE(EXTERN(...))`. You can use external data with SELECT, INSERT, and 
REPLACE queries.
+
+The following query reads external data:
+
+```sql
+SELECT
+  *
+FROM TABLE(
+  EXTERN(
+    '{"type": "http", "uris": 
["https://static.imply.io/data/wikipedia.json.gz"]}',
+    '{"type": "json"}',
+    '[{"name": "timestamp", "type": "string"}, {"name": "page", "type": 
"string"}, {"name": "user", "type": "string"}]'
+  )
+)
+LIMIT 100
+``` 
+
+For more information about the syntax, see [EXTERN](./msq-reference.md#extern).
+
+### Insert data
+
+With the MSQ task engine, Druid can use the results of a query task to create 
a new datasource or to append to an existing datasource. Syntactically, there 
is no difference between the two. These operations use the INSERT INTO ... 
SELECT syntax.
+
+All SELECT capabilities are available for INSERT queries. However, the MSQ 
task engine does not include all the existing SQL query features of Druid. See 
[Known issues](./msq-known-issues.md) for a list of capabilities that aren't 
available.
+
+The following example query inserts data from an external source into a table 
named `w000` and partitions it by day:
+
+```sql
+INSERT INTO w000
+SELECT
+  TIME_PARSE("timestamp") AS __time,
+  "page",
+  "user"
+FROM TABLE(
+  EXTERN(
+    '{"type": "http", "uris": 
["https://static.imply.io/data/wikipedia.json.gz"]}',
+    '{"type": "json"}',
+    '[{"name": "timestamp", "type": "string"}, {"name": "page", "type": 
"string"}, {"name": "user", "type": "string"}]'
+  )
+)
+PARTITIONED BY DAY
+```
+
+For more information about the syntax, see [INSERT](./msq-reference.md#insert).
+
+### Replace data 
+
+The syntax for REPLACE is similar to INSERT. All SELECT functionality is 
available for REPLACE queries.
+Note that the MSQ task engine does not yet implement all native Druid query 
features.
+For details, see [Known issues](./msq-known-issues.md).
+
+When working with REPLACE queries, keep the following in mind:
+
+- The intervals generated as a result of the OVERWRITE WHERE query must align 
with the granularity specified in the PARTITIONED BY clause.
+- OVERWRITE WHERE queries only support the `__time` column.
+
+For more information about the syntax, see 
[REPLACE](./msq-reference.md#replace).
+
+The following examples show how to replace data in a table.
+
+#### REPLACE all data
+
+You can replace all the data in a table by using REPLACE INTO ... OVERWRITE 
ALL SELECT:
+
+```sql
+REPLACE INTO w000
+OVERWRITE ALL
+SELECT
+  TIME_PARSE("timestamp") AS __time,
+  "page",
+  "user"
+FROM TABLE(
+  EXTERN(
+    '{"type": "http", "uris": 
["https://static.imply.io/data/wikipedia.json.gz"]}',
+    '{"type": "json"}',
+    '[{"name": "timestamp", "type": "string"}, {"name": "page", "type": 
"string"}, {"name": "user", "type": "string"}]'
+  )
+)
+PARTITIONED BY DAY
+```
+
+#### REPLACE some data
+
+You can replace some of the data in a table by using REPLACE INTO ... 
OVERWRITE WHERE ... SELECT:
+
+```sql
+REPLACE INTO w000
+OVERWRITE WHERE __time >= TIMESTAMP '2019-08-25' AND __time < TIMESTAMP 
'2019-08-28'
+SELECT
+  TIME_PARSE("timestamp") AS __time,
+  "page",
+  "user"
+FROM TABLE(
+  EXTERN(
+    '{"type": "http", "uris": 
["https://static.imply.io/data/wikipedia.json.gz"]}',
+    '{"type": "json"}',
+    '[{"name": "timestamp", "type": "string"}, {"name": "page", "type": 
"string"}, {"name": "user", "type": "string"}]'
+  )
+)
+PARTITIONED BY DAY
+```
+
+## Adjust query behavior
+
+In addition to the basic functions, you can further modify your query behavior 
to control how your queries run or what your results look like. You can control 
how your queries behave by changing the following:
+
+### Primary timestamp
+
+Druid tables always include a primary timestamp named `__time`, so your 
ingestion query should generally include a column named `__time`. 
+
+The following formats are supported for `__time` in the source data:
+- ISO 8601 with 'T' separator, such as "2000-01-01T01:02:03.456"
+- Milliseconds since Unix epoch (00:00:00 UTC on January 1, 1970)
+
+The `__time` column is used for time-based partitioning, such as `PARTITIONED 
BY DAY`.
+
+If you use `PARTITIONED BY ALL` or `PARTITIONED BY ALL TIME`, time-based
+partitioning is disabled. In these cases, your ingestion query doesn't need
+to include a `__time` column. However, Druid still creates a `__time` column 
+in your Druid table and sets all timestamps to 1970-01-01 00:00:00.
+
+For more information, see [Primary 
timestamp](../ingestion/data-model.md#primary-timestamp).
+
+### PARTITIONED BY
+
+INSERT and REPLACE queries require the PARTITIONED BY clause, which determines 
how time-based partitioning is done. In Druid, data is split into segments, one 
or more per time chunk defined by the PARTITIONED BY granularity. A good 
general rule is to adjust the granularity so that each segment contains about 
five million rows. Choose a granularity based on your ingestion rate. For 
example, if you ingest a million rows per day, PARTITION BY DAY is good. If you 
ingest a million rows an hour, choose PARTITION BY HOUR instead.

Review Comment:
   Should we make 5 million bold ?



##########
docs/multi-stage-query/msq-known-issues.md:
##########
@@ -0,0 +1,102 @@
+---
+id: known-issues
+title: SQL-based ingestion known issues
+sidebar_label: Known issues
+---
+
+> SQL-based ingestion and the multi-stage query task engine are experimental 
features available starting in Druid 24.0. You can use it in place of the 
existing native batch and Hadoop based ingestion systems. As an experimental 
feature, functionality documented on this page is subject to change or removal 
in future releases. Review the release notes and this page to stay up to date 
on changes.
+
+## General query execution
+
+- There's no fault tolerance. If any task fails, the entire query fails. 
+
+- In case of a worker crash, stage outputs on S3 are not deleted 
automatically. You may need to delete
+  the file using an external process or create an S3 lifecycle policy to 
remove the objects
+  under `druid.msq.intermediate.storage.prefix`. A good start would be to 
delete the objects after 3 days if they are
+  not automatically deleted.
+
+- Only one local file system per server is used for stage output data during 
multi-stage query
+  execution. If your servers have multiple local file systems, this causes 
queries to exhaust
+  available disk space earlier than expected. As a workaround, you can use 
[durable storage for shuffle meshes](./msq-durable-storage.md).
+
+- When `msqMaxNumTasks` is higher than the total
+  capacity of the cluster, more tasks may be launched than can run at once. 
This leads to a
+  [TaskStartTimeout](./msq-reference.md#context-parameters) error code, as 
there is never enough capacity to run the query.
+  To avoid this, set `msqMaxNumTasks` to a number of tasks that can run 
simultaneously on your cluster.
+
+- When `msqTaskAssignment` is set to `auto`, the system generates one task per 
input file for certain splittable
+  input sources where file sizes are not known ahead of time. This includes 
the `http` input source, where the system
+  generates one task per URI.
+
+## Memory usage
+
+- INSERT queries can consume excessive memory when using complex types due to 
inaccurate footprint
+  estimation. This can appear as an OutOfMemoryError during the 
SegmentGenerator stage when using
+  sketches. If you run into this issue, try manually lowering the value of the
+  [`msqRowsInMemory`](./msq-reference.md#context-parameters) parameter.
+
+- EXTERN loads an entire row group into memory at once when reading from 
Parquet files. Row groups
+  can be up to 1 GB in size, which can lead to excessive heap usage when 
reading many files in
+  parallel. This can appear as an OutOfMemoryError during stages that read 
Parquet input files. If
+  you run into this issue, try using a smaller number of worker tasks or you 
can increase the heap
+  size of your Indexers or of your Middle Manager-launched indexing tasks.
+
+- Ingesting a very long row may consume excessive memory and result in an 
OutOfMemoryError. If a row is read 
+  which requires more memory than is available, the service might throw 
OutOfMemoryError. If you run into this
+  issue, allocate enough memory to be able to store the largest row to the 
indexer. 
+
+## SELECT queries
+
+- SELECT query results do not include real-time data until it has been 
published.
+
+- TIMESTAMP types are formatted as numbers rather than ISO8601 timestamp
+  strings, which differs from Druid's standard result format. 
+
+- BOOLEAN types are formatted as numbers like `1` and `0` rather
+  than `true` or `false`, which differs from Druid's standard result
+  format. 
+
+- TopN is not implemented. The context parameter
+  `useApproximateTopN` is ignored and always treated as if it
+  were `false`. Therefore, topN-shaped queries will
+  always run using the groupBy engine. There is no loss of
+  functionality, but there may be a performance impact, since
+  these queries will run using an exact algorithm instead of an
+  approximate one.
+- GROUPING SETS is not implemented. Queries that use GROUPING SETS
+  will fail.
+- The numeric flavors of the EARLIEST and LATEST aggregators do not work 
properly. Attempting to use the numeric flavors of these aggregators will lead 
to an error like `java.lang.ClassCastException: class java.lang.Double cannot 
be cast to class org.apache.druid.collections.SerializablePair`. The string 
flavors, however, do work properly.
+
+##  INSERT queries
+
+- The [schemaless 
dimensions](../ingestion/ingestion-spec.md#inclusions-and-exclusions)
+feature is not available. All columns and their types must be specified 
explicitly.
+
+- [Segment metadata queries](../querying/segmentmetadataquery.md)
+  on datasources ingested with the Multi-Stage Query Engine will return values 
for`timestampSpec` that are not usable
+  for introspection.
+
+- When INSERT with GROUP BY does the match the criteria mentioned in [GROUP 
BY](./index.md#group-by),  the multi-stage engine generates segments that 
Druid's compaction
+  functionality is not able to further roll up. This applies to automatic 
compaction as well as manually
+  issued `compact` tasks. Individual queries executed with the multi-stage 
engine always guarantee
+  perfect rollup for their output, so this only matters if you are performing 
a sequence of INSERT
+  queries that each append data to the same time chunk. If necessary, you can 
compact such data
+  using another SQL query instead of a `compact` task.
+
+- When using INSERT with GROUP BY, splitting of large partitions is not 
currently
+  implemented. If a single partition key appears in a
+  very large number of rows, an oversized segment will be created.
+  You can mitigate this by adding additional columns to your
+  partition key. Note that partition splitting _does_ work properly
+  when performing INSERT without GROUP BY.
+
+- INSERT with column lists, like
+  `INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented.
+
+## EXTERN queries
+
+- EXTERN does not accept `druid` input sources.
+
+## Missing guardrails
+

Review Comment:
   Maximum number of input files. No guardrail today means the controller can 
potentially run out of memory tracking them all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] cryptoe commented on a diff in pull request #12983: msq: add multi-stage-query docs

Reply via email to