[GitHub] [druid] vogievetsky commented on a diff in pull request #12983: msq: add multi-stage-query docs

GitBox Tue, 30 Aug 2022 14:47:56 -0700


vogievetsky commented on code in PR #12983:
URL: https://github.com/apache/druid/pull/12983#discussion_r958957881



##########
docs/multi-stage-query/msq-reference.md:
##########
@@ -0,0 +1,146 @@
+---
+id: reference
+title: SQL-based ingestion reference
+sidebar_label: Reference
+---
+
+> SQL-based ingestion using the multi-stage query task engine is our 
recommended solution starting in Druid 24.0. Alternative ingestion solutions, 
such as native batch and Hadoop-based ingestion systems, will still be 
supported. We recommend you read all [known issues](./msq-known-issues.md) and 
test the feature in a development environment before rolling it out in 
production. Using the multi-stage query task engine with `SELECT` statements 
that do not write to a datasource is experimental.
+
+This topic is a reference guide for the multi-stage query architecture in 
Apache Druid.
+
+## Context parameters
+
+In addition to the Druid SQL [context 
parameters](../querying/sql-query-context.md), the multi-stage query task 
engine accepts certain context parameters that are specific to it. 
+
+Use context parameters alongside your queries to customize the behavior of the 
query. If you're using the API, include the context parameters in the query 
context when you submit a query:
+
+```json
+{
+  "query": "SELECT 1 + 1",
+  "context": {
+    "<key>": "<value>",
+    "maxNumTasks": 3
+  }
+}
+```
+
+If you're using the Druid console, you can specify the context parameters 
through various UI options.
+
+The following table lists the context parameters for the MSQ task engine:
+
+|Parameter|Description|Default value|
+|---------|-----------|-------------|
+| maxNumTasks | SELECT, INSERT, REPLACE<br /><br />The maximum total number of 
tasks to launch, including the controller task. The lowest possible value for 
this setting is 2: one controller and one worker. All tasks must be able to 
launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` 
error code after approximately 10 minutes.<br /><br />May also be provided as 
`numTasks`. If both are present, `maxNumTasks` takes priority.| 2 |
+| taskAssignment | SELECT, INSERT, REPLACE<br /><br />Determines how many 
tasks to use. Possible values include: <ul><li>`max`: Use as many tasks as 
possible, up to the maximum `maxNumTasks`.</li><li>`auto`: Use as few tasks as 
possible without exceeding 10 GiB or 10,000 files per task. Review the 
[limitations](./msq-known-issues.md#general-query-execution) of `auto` mode 
before using it.</li></ui>| `max` |
+| finalizeAggregations | SELECT, INSERT, REPLACE<br /><br />Determines the 
type of aggregation to return. If true, Druid finalizes the results of complex 
aggregations that directly appear in query results. If false, Druid returns the 
aggregation's intermediate type rather than finalized type. This parameter is 
useful during ingestion, where it enables storing sketches directly in Druid 
tables. For more information about aggregations, see [SQL aggregation 
functions](../querying/sql-aggregations.md). | true |
+| rowsInMemory | INSERT or REPLACE<br /><br />Maximum number of rows to store 
in memory at once before flushing to disk during the segment generation 
process. Ignored for non-INSERT queries. In most cases, use the default value. 
You may need to override the default if you run into one of the [known issues 
around memory usage](./msq-known-issues.md#memory-usage)</a>. | 100,000 |
+| segmentSortOrder | INSERT or REPLACE<br /><br />Normally, Druid sorts rows 
in individual segments using `__time` first, followed by the [CLUSTERED 
BY](./index.md#clustered-by) clause. When you set `segmentSortOrder`, Druid 
sorts rows in segments using this column list first, followed by the CLUSTERED 
BY order.<br /><br />You provide the column list as comma-separated values or 
as a JSON array in string form. If your query includes `__time`, then this list 
must begin with `__time`. For example, consider an INSERT query that uses 
`CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within 
each time chunk, Druid assigns rows to segments based on `country`, and then 
within each of those segments, Druid sorts those rows by `__time` first, then 
`city`, then `country`. | empty list |
+| maxParseExceptions| SELECT, INSERT, REPLACE<br /><br />Maximum number of 
parse exceptions that are ignored while executing the query before it stops 
with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value 
to -1.| 0 |
+| rowsPerSegment | INSERT or REPLACE<br /><br />The number of rows per segment 
to target. The actual number of rows per segment may be somewhat higher or 
lower than this number. In most cases, use the default. For general information 
about sizing rows per segment, see [Segment Size 
Optimization](../operations/segment-optimization.md). | 3,000,000 |
+| sqlTimeZone | Sets the time zone for this connection, which affects how time 
functions and timestamp literals behave. Use a time zone name like 
"America/Los_Angeles" or offset like "-08:00".| `druid.sql.planner.sqlTimeZone` 
on the Broker (default: UTC)|
+| useApproximateCountDistinct | Whether to use an approximate cardinality 
algorithm for `COUNT(DISTINCT foo)`.| 
`druid.sql.planner.useApproximateCountDistinct` on the Broker (default: true)|
+
+## Error codes
+
+Error codes have corresponding human-readable messages that explain the error. 
For more information about the error codes, see [Error 
codes](./msq-concepts.md#error-codes).
+
+## SQL syntax
+
+The MSQ task engine has three primary SQL functions: 
+
+- EXTERN
+- INSERT
+- REPLACE
+
+For information about using these functions and their corresponding examples, 
see [MSQ task engine query syntax](./index.md#msq-task-engine-query-syntax). 
For information about adjusting the shape of your data, see [Adjust query 
behavior](./index.md#adjust-query-behavior).
+
+### EXTERN
+
+Use the EXTERN function to read external data.
+
+Function format:
+
+```sql
+SELECT
+ <column>
+FROM TABLE(
+  EXTERN(
+    '<Druid input source>',
+    '<Druid input format>',
+    '<row signature>'
+  )
+)
+```
+
+EXTERN consists of the following parts:
+
+1.  Any [Druid input source](../ingestion/native-batch-input-source.md) as a 
JSON-encoded string.
+2.  Any [Druid input format](../ingestion/data-formats.md) as a JSON-encoded 
string.
+3.  A row signature, as a JSON-encoded array of column descriptors. Each 
column descriptor must have a `name` and a `type`. The type can be `string`, 
`long`, `double`, or `float`. This row signature is used to map the external 
data into the SQL layer.
+
+### INSERT
+

Review Comment:
   I think somewhere here we should call out a note about how the MSQ INSERT 
syntax (REPLACE also, but this has more impact for INSERT) deviates from the 
SQL standard in that the column are mapped by name and not positionally.
   
   Maybe something like:
   
   ```
   Please note that unlike standard SQL the data is inserted according to 
column name and not positionally which means that it is important to get the 
output column names of subsequent inserts to be the same as the table and not 
to simply reply on their positions within the SELECT clause.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] vogievetsky commented on a diff in pull request #12983: msq: add multi-stage-query docs

Reply via email to