Re: [PR] better documentation for the differences between arrays and mvds (druid)

via GitHub Wed, 01 Nov 2023 00:30:30 -0700


clintropolis commented on code in PR #15245:
URL: https://github.com/apache/druid/pull/15245#discussion_r1378473552



##########
docs/multi-stage-query/reference.md:
##########
@@ -232,23 +232,25 @@ If you're using the web console, you can specify the 
context parameters through
 
 The following table lists the context parameters for the MSQ task engine:
 
-| Parameter | Description                                                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                          | Default value |
-|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
-| `maxNumTasks` | SELECT, INSERT, REPLACE<br /><br />The maximum total number 
of tasks to launch, including the controller task. The lowest possible value 
for this setting is 2: one controller and one worker. All tasks must be able to 
launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` 
error code after approximately 10 minutes.<br /><br />May also be provided as 
`numTasks`. If both are present, `maxNumTasks` takes priority.                  
                                                                                
                                                                                
                                                                                
                                                                      | 2 |
-| `taskAssignment` | SELECT, INSERT, REPLACE<br /><br />Determines how many 
tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as 
possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be 
determined through directory listing (for example: local files, S3, GCS, HDFS) 
uses as few tasks as possible without exceeding 512 MiB or 10,000 files per 
task, unless exceeding these limits is necessary to stay within `maxNumTasks`. 
When calculating the size of files, the weighted size is used, which considers 
the file format and compression format used if any. When file sizes cannot be 
determined through directory listing (for example: http), behaves the same as 
`max`.</li></ul>                                                                
             | `max` |
-| `finalizeAggregations` | SELECT, INSERT, REPLACE<br /><br />Determines the 
type of aggregation to return. If true, Druid finalizes the results of complex 
aggregations that directly appear in query results. If false, Druid returns the 
aggregation's intermediate type rather than finalized type. This parameter is 
useful during ingestion, where it enables storing sketches directly in Druid 
tables. For more information about aggregations, see [SQL aggregation 
functions](../querying/sql-aggregations.md).                                    
                                                                                
                                                                                
                                                                                
         | true |
-| `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE<br /><br />Algorithm to use for 
JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for 
sort-merge join. Affects all JOIN operations in the query. This is a hint to 
the MSQ engine and the actual joins in the query may proceed in a different way 
than specified. See [Joins](#joins) for more details.                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                         | `broadcast` |
-| `rowsInMemory` | INSERT or REPLACE<br /><br />Maximum number of rows to 
store in memory at once before flushing to disk during the segment generation 
process. Ignored for non-INSERT queries. In most cases, use the default value. 
You may need to override the default if you run into one of the [known 
issues](./known-issues.md) around memory usage.                                 
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
| 100,000 |
+| Parameter | Description | Default value |
+|---|---|---|
+| `maxNumTasks` | SELECT, INSERT, REPLACE<br /><br />The maximum total number 
of tasks to launch, including the controller task. The lowest possible value 
for this setting is 2: one controller and one worker. All tasks must be able to 
launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` 
error code after approximately 10 minutes.<br /><br />May also be provided as 
`numTasks`. If both are present, `maxNumTasks` takes priority. | 2 |
+| `taskAssignment` | SELECT, INSERT, REPLACE<br /><br />Determines how many 
tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as 
possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be 
determined through directory listing (for example: local files, S3, GCS, HDFS) 
uses as few tasks as possible without exceeding 512 MiB or 10,000 files per 
task, unless exceeding these limits is necessary to stay within `maxNumTasks`. 
When calculating the size of files, the weighted size is used, which considers 
the file format and compression format used if any. When file sizes cannot be 
determined through directory listing (for example: http), behaves the same as 
`max`.</li></ul> | `max` |
+| `finalizeAggregations` | SELECT, INSERT, REPLACE<br /><br />Determines the 
type of aggregation to return. If true, Druid finalizes the results of complex 
aggregations that directly appear in query results. If false, Druid returns the 
aggregation's intermediate type rather than finalized type. This parameter is 
useful during ingestion, where it enables storing sketches directly in Druid 
tables. For more information about aggregations, see [SQL aggregation 
functions](../querying/sql-aggregations.md). | true |
+| `arrayIngestMode` | INSERT, REPLACE<br /><br /> Controls how ARRAY type 
values are stored in Druid segments. When set to `'array'` (recommended for SQL 
compliance), Druid will store all ARRAY typed values in [ARRAY typed 
columns](../querying/arrays.md), and supports storing both VARCHAR and numeric 
typed arrays. When set to `'mvd'` (the default, for backwards compatibility), 
Druid only supports VARCHAR typed arrays, and will store them as [multi-value 
string columns](../querying/multi-value-dimensions.md). When set to `none`, 
Druid will throw an exception when trying to store any type of arrays, used to 
help migrate operators from `'mvd'` mode to `'array'` mode and force query 
writers to make an explicit choice between ARRAY and multi-value VARCHAR typed 
columns. | `'mvd'` (for backwards compatibility, recommended to use `array` for 
SQL compliance)|

Review Comment:
   I think operators would need to set the default query context with 
`druid.query.default.context.arrayIngestMode` which could then be overridden on 
a per query basis



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] better documentation for the differences between arrays and mvds (druid)

Reply via email to