gianm commented on code in PR #15245:
URL: https://github.com/apache/druid/pull/15245#discussion_r1372301620
##########
docs/multi-stage-query/concepts.md:
##########
@@ -88,6 +88,9 @@ When deciding whether to use `REPLACE` or `INSERT`, keep in
mind that segments g
with dimension-based pruning but those generated with `INSERT` cannot. For
more information about the requirements
for dimension-based pruning, see [Clustering](#clustering).
+To insert [ARRAY types](../querying/arrays.md), be sure to set context flag
`"arrayIngestMode":"array"` which allows
Review Comment:
Hmm this seems like the wrong place to put this. It's generic docs about
`INSERT`, we don't want to gunk it up with stuff about specific data that might
be inserted. (Otherwise this would be, like, 10 times longer.)
I suggest cutting it, and relying on the examples and the array docs to
guide people.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
Review Comment:
"Arrays" is a better title and scope — as people can use arrays even if they
don't have array columns. For example they can use `MV_TO_ARRAY`, `ARRAY_AGG`,
etc.
##########
docs/multi-stage-query/concepts.md:
##########
@@ -192,10 +195,13 @@ To perform ingestion with rollup:
2. Set [`finalizeAggregations: false`](reference.md#context-parameters) in
your context. This causes aggregation
functions to write their internal state to the generated segments, instead
of the finalized end result, and enables
further aggregation at query time.
-3. Wrap all multi-value strings in `MV_TO_ARRAY(...)` and set
[`groupByEnableMultiValueUnnesting:
- false`](reference.md#context-parameters) in your context. This ensures that
multi-value strings are left alone and
- remain lists, instead of being [automatically
unnested](../querying/sql-data-types.md#multi-value-strings) by the
- `GROUP BY` operator.
+3. To ingest [Druid multi-value
dimensions](../querying/multi-value-dimensions.md), wrap all multi-value
strings
Review Comment:
This direction has become too complicated for people to understand, so I
think we'll need an example.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
Review Comment:
Sort of unclear what the verb "include" refers to. The sentence construction
makes it sound like the arrays themselves must include the context parameter.
But that isn't right. Also, "multi-stage ingestion" isn't a thing 🙂— it's
"SQL-based ingestion" or "multi-stage query".
So, suggestion:
> Arrays can also be inserted with [SQL-based
ingestion](../multi-stage-query/index.md) when you use the context parameter
`"arrayIngestMode": "array"`.
Also link the text `context parameter` to
`docs/multi-stage-query/reference.md#context`.
Also include some text about what will happen if you _don't_ do
`arrayIngestMode: array`. Something like: string arrays will be converted to
multi-value dimensions, and numeric arrays will cause the query to fail with an
error (what error?)
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
+
+### Filtering
+
+All query types, as well as [filtered
aggregators](aggregations.md#filtered-aggregator), can filter on array typed
columns. Filters follow these rules for array types:
+
+- Value filters, like "equality", "range" match on entire array values
+- The "null" filter will match rows where the entire array value is null
+- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the
behavior specified by those functions
+- All other filters do not directly support ARRAY types
+
+#### Example: equality
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong = ARRAY[1,2,3]
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row3","arrayString":"[]","arrayLong":"[1,2,3]","arrayDouble":"[null,2.2,1.1]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: null
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong is null
Review Comment:
style nit: capitalize SQL keywords like `IS NULL`
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
Review Comment:
Somewhere around here we should have a section "String arrays vs.
multi-value dimensions" that sets people straight about the differences.
Suggested text:
> Avoid confusing string arrays with multi-value dimensions (link to MVD
docs). Arrays and multi-value dimensions are stored in different column types,
and query behavior is different. You can use the functions `MV_TO_ARRAY` and
`ARRAY_TO_MV` to convert between the two if needed. In general, we recommend
using arrays whenever possible, since they are a newer and more powerful
feature.
>
> Use care during ingestion to ensure you get the type you want.
>
> To get arrays when performing an ingestion using JSON ingestion specs,
such as native batch (link) or streaming ingestion (link), use dimension type
`auto` or enable `useSchemaDiscovery`. When performing a SQL-based ingestion,
write a query that generates arrays and set the context parameter
`arrayIngestMode: array`.
>
> To get multi-value dimensions when performing an ingestion using JSON
ingestion specs, use dimension type `string` and do not enable
`useSchemaDiscovery`. When performing a SQL-based ingestion, wrap arrays in
`ARRAY_TO_MV` (link to examples). Note that multi-value dimensions must be
strings, even though arrays may be strings or numeric.
>
> You can tell which type you have by checking the
`INFORMATION_SCHEMA.COLUMNS` table, using a query like `SELECT COLUMN_NAME,
DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'mytable'`. Arrays
are type `ARRAY`, multi-value strings are type `VARCHAR`.
I suggest including the same exact text in `multi-value-dimensions.md`, or
at least linking to this section prominently.
##########
docs/querying/sql-data-types.md:
##########
@@ -75,6 +75,17 @@ Casts between two SQL types that have different Druid
runtime types generate a r
If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`,
Druid a substitutes [NULL](#null-values).
When `druid.generic.useDefaultValueForNull = true` (legacy mode), Druid
instead substitutes a default value, including when NULL values cast to
non-nullable types. For example, if `druid.generic.useDefaultValueForNull =
true`, a null VARCHAR cast to BIGINT is converted to a zero.
+## Arrays
+
+Druid supports [ARRAY types](arrays.md), which behave as standard SQL arrays,
where results are grouped by matching entire arrays. The [`UNNEST`
operator](./sql-array-functions.md#unn) can be used to perform operations on
individual array elements, translating each element into a separate row.
+
+ARRAY typed columns can be stored in segments with class JSON based ingestion
using the 'auto' typed dimension schema shared with [schema
auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions)
to detect and ingest arrays as ARRAY typed columns. For [SQL based
ingestion](../multi-stage-query/index.md), the query context parameter
`arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In
Druid 28, the default mode for this parameter is `'mvd'` for backwards
compatibility, which instead can only handle `ARRAY<STRING>` which it stores in
[multi-value string columns](#multi-value-strings).
Review Comment:
No real reason to have the extra single quotes in `'mvd'`. Doing `mvd` is
preferred.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
Review Comment:
"Arrays" is a better title and scope — as people can use arrays even if they
don't have array columns. For example they can use `MV_TO_ARRAY`, `ARRAY_AGG`,
etc.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
Review Comment:
Would be good to include an aggregation function here, since rollup examples
tend to be clearer with one of them. Even `COUNT(*)` is illustrative.
##########
docs/multi-stage-query/reference.md:
##########
@@ -232,23 +232,25 @@ If you're using the web console, you can specify the
context parameters through
The following table lists the context parameters for the MSQ task engine:
-| Parameter | Description
| Default value |
-|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
-| `maxNumTasks` | SELECT, INSERT, REPLACE<br /><br />The maximum total number
of tasks to launch, including the controller task. The lowest possible value
for this setting is 2: one controller and one worker. All tasks must be able to
launch simultaneously. If they cannot, the query returns a `TaskStartTimeout`
error code after approximately 10 minutes.<br /><br />May also be provided as
`numTasks`. If both are present, `maxNumTasks` takes priority.
| 2 |
-| `taskAssignment` | SELECT, INSERT, REPLACE<br /><br />Determines how many
tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as
possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be
determined through directory listing (for example: local files, S3, GCS, HDFS)
uses as few tasks as possible without exceeding 512 MiB or 10,000 files per
task, unless exceeding these limits is necessary to stay within `maxNumTasks`.
When calculating the size of files, the weighted size is used, which considers
the file format and compression format used if any. When file sizes cannot be
determined through directory listing (for example: http), behaves the same as
`max`.</li></ul>
| `max` |
-| `finalizeAggregations` | SELECT, INSERT, REPLACE<br /><br />Determines the
type of aggregation to return. If true, Druid finalizes the results of complex
aggregations that directly appear in query results. If false, Druid returns the
aggregation's intermediate type rather than finalized type. This parameter is
useful during ingestion, where it enables storing sketches directly in Druid
tables. For more information about aggregations, see [SQL aggregation
functions](../querying/sql-aggregations.md).
| true |
-| `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE<br /><br />Algorithm to use for
JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for
sort-merge join. Affects all JOIN operations in the query. This is a hint to
the MSQ engine and the actual joins in the query may proceed in a different way
than specified. See [Joins](#joins) for more details.
| `broadcast` |
-| `rowsInMemory` | INSERT or REPLACE<br /><br />Maximum number of rows to
store in memory at once before flushing to disk during the segment generation
process. Ignored for non-INSERT queries. In most cases, use the default value.
You may need to override the default if you run into one of the [known
issues](./known-issues.md) around memory usage.
| 100,000 |
+| Parameter | Description | Default value |
+|---|---|---|
+| `maxNumTasks` | SELECT, INSERT, REPLACE<br /><br />The maximum total number
of tasks to launch, including the controller task. The lowest possible value
for this setting is 2: one controller and one worker. All tasks must be able to
launch simultaneously. If they cannot, the query returns a `TaskStartTimeout`
error code after approximately 10 minutes.<br /><br />May also be provided as
`numTasks`. If both are present, `maxNumTasks` takes priority. | 2 |
+| `taskAssignment` | SELECT, INSERT, REPLACE<br /><br />Determines how many
tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as
possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be
determined through directory listing (for example: local files, S3, GCS, HDFS)
uses as few tasks as possible without exceeding 512 MiB or 10,000 files per
task, unless exceeding these limits is necessary to stay within `maxNumTasks`.
When calculating the size of files, the weighted size is used, which considers
the file format and compression format used if any. When file sizes cannot be
determined through directory listing (for example: http), behaves the same as
`max`.</li></ul> | `max` |
+| `finalizeAggregations` | SELECT, INSERT, REPLACE<br /><br />Determines the
type of aggregation to return. If true, Druid finalizes the results of complex
aggregations that directly appear in query results. If false, Druid returns the
aggregation's intermediate type rather than finalized type. This parameter is
useful during ingestion, where it enables storing sketches directly in Druid
tables. For more information about aggregations, see [SQL aggregation
functions](../querying/sql-aggregations.md). | true |
+| `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE<br /><br />Algorithm to use for
JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for
sort-merge join. Affects all JOIN operations in the query. This is a hint to
the MSQ engine and the actual joins in the query may proceed in a different way
than specified. See [Joins](#joins) for more details. | `broadcast` |
+| `rowsInMemory` | INSERT or REPLACE<br /><br />Maximum number of rows to
store in memory at once before flushing to disk during the segment generation
process. Ignored for non-INSERT queries. In most cases, use the default value.
You may need to override the default if you run into one of the [known
issues](./known-issues.md) around memory usage. | 100,000 |
| `segmentSortOrder` | INSERT or REPLACE<br /><br />Normally, Druid sorts rows
in individual segments using `__time` first, followed by the [CLUSTERED
BY](#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in
segments using this column list first, followed by the CLUSTERED BY order.<br
/><br />You provide the column list as comma-separated values or as a JSON
array in string form. If your query includes `__time`, then this list must
begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED
BY country` and has `segmentSortOrder` set to `__time,city`. Within each time
chunk, Druid assigns rows to segments based on `country`, and then within each
of those segments, Druid sorts those rows by `__time` first, then `city`, then
`country`. | empty list |
-| `maxParseExceptions`| SELECT, INSERT, REPLACE<br /><br />Maximum number of
parse exceptions that are ignored while executing the query before it stops
with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value
to -1.
| 0 |
-| `rowsPerSegment` | INSERT or REPLACE<br /><br />The number of rows per
segment to target. The actual number of rows per segment may be somewhat higher
or lower than this number. In most cases, use the default. For general
information about sizing rows per segment, see [Segment Size
Optimization](../operations/segment-optimization.md).
| 3,000,000 |
-| `indexSpec` | INSERT or REPLACE<br /><br />An
[`indexSpec`](../ingestion/ingestion-spec.md#indexspec) to use when generating
segments. May be a JSON string or object. See [Front
coding](../ingestion/ingestion-spec.md#front-coding) for details on configuring
an `indexSpec` with front coding.
| See
[`indexSpec`](../ingestion/ingestion-spec.md#indexspec). |
-| `durableShuffleStorage` | SELECT, INSERT, REPLACE <br /><br />Whether to use
durable storage for shuffle mesh. To use this feature, configure the durable
storage at the server level using
`druid.msq.intermediate.storage.enable=true`). If these properties are not
configured, any query with the context variable `durableShuffleStorage=true`
fails with a configuration error. <br /><br />
| `false` |
-| `faultTolerance` | SELECT, INSERT, REPLACE<br /><br /> Whether to turn on
fault tolerance mode or not. Failed workers are retried based on
[Limits](#limits). Cannot be used when `durableShuffleStorage` is explicitly
set to false.
| `false` |
-| `selectDestination` | SELECT<br /><br /> Controls where the final result of
the select query is written. <br />Use `taskReport`(the default) to write
select results to the task report. <b> This is not scalable since task reports
size explodes for large results </b> <br/>Use `durableStorage` to write results
to durable storage location. <b>For large results sets, its recommended to use
`durableStorage` </b>. To configure durable storage see
[`this`](#durable-storage) section.
| `taskRep
ort` |
-| `waitUntilSegmentsLoad` | INSERT, REPLACE<br /><br /> If set, the ingest
query waits for the generated segment to be loaded before exiting, else the
ingest query exits without waiting. The task and live reports contain the
information about the status of loading segments if this flag is set. This will
ensure that any future queries made after the ingestion exits will include
results from the ingestion. The drawback is that the controller task will stall
until the segments are loaded.
| `false` |
-| `includeSegmentSource` | SELECT, INSERT, REPLACE<br /><br /> Controls the
sources, which will be queried for results in addition to the segments present
on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only
non-realtime (published and used) segments will be downloaded from deep
storage. If this value is `REALTIME`, results will also be included from
realtime tasks.
| `NONE` |
-| `rowsPerPage` | SELECT<br /><br />The number of rows per page to target. The
actual number of rows per page may be somewhat higher or lower than this
number. In most cases, use the default.<br /> This property comes into effect
only when `selectDestination` is set to `durableStorage`
| 100000 |
+| `maxParseExceptions`| SELECT, INSERT, REPLACE<br /><br />Maximum number of
parse exceptions that are ignored while executing the query before it stops
with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value
to -1. | 0 |
+| `rowsPerSegment` | INSERT or REPLACE<br /><br />The number of rows per
segment to target. The actual number of rows per segment may be somewhat higher
or lower than this number. In most cases, use the default. For general
information about sizing rows per segment, see [Segment Size
Optimization](../operations/segment-optimization.md). | 3,000,000 |
+| `indexSpec` | INSERT or REPLACE<br /><br />An
[`indexSpec`](../ingestion/ingestion-spec.md#indexspec) to use when generating
segments. May be a JSON string or object. See [Front
coding](../ingestion/ingestion-spec.md#front-coding) for details on configuring
an `indexSpec` with front coding. | See
[`indexSpec`](../ingestion/ingestion-spec.md#indexspec). |
+| `durableShuffleStorage` | SELECT, INSERT, REPLACE <br /><br />Whether to use
durable storage for shuffle mesh. To use this feature, configure the durable
storage at the server level using
`druid.msq.intermediate.storage.enable=true`). If these properties are not
configured, any query with the context variable `durableShuffleStorage=true`
fails with a configuration error. <br /><br /> | `false` |
+| `faultTolerance` | SELECT, INSERT, REPLACE<br /><br /> Whether to turn on
fault tolerance mode or not. Failed workers are retried based on
[Limits](#limits). Cannot be used when `durableShuffleStorage` is explicitly
set to false. | `false` |
+| `selectDestination` | SELECT<br /><br /> Controls where the final result of
the select query is written. <br />Use `taskReport`(the default) to write
select results to the task report. <b> This is not scalable since task reports
size explodes for large results </b> <br/>Use `durableStorage` to write results
to durable storage location. <b>For large results sets, its recommended to use
`durableStorage` </b>. To configure durable storage see
[`this`](#durable-storage) section. | `taskReport` |
+| `waitUntilSegmentsLoad` | INSERT, REPLACE<br /><br /> If set, the ingest
query waits for the generated segment to be loaded before exiting, else the
ingest query exits without waiting. The task and live reports contain the
information about the status of loading segments if this flag is set. This will
ensure that any future queries made after the ingestion exits will include
results from the ingestion. The drawback is that the controller task will stall
till the segments are loaded. | `false` |
+| `includeSegmentSource` | SELECT, INSERT, REPLACE<br /><br /> Controls the
sources, which will be queried for results in addition to the segments present
on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only
non-realtime (published and used) segments will be downloaded from deep
storage. If this value is `REALTIME`, results will also be included from
realtime tasks. | `NONE` |
+| `rowsPerPage` | SELECT<br /><br />The number of rows per page to target. The
actual number of rows per page may be somewhat higher or lower than this
number. In most cases, use the default.<br /> This property comes into effect
only when `selectDestination` is set to `durableStorage` | 100000 |
+| `arrayIngestMode` | INSERT, REPLACE<br /><br /> Controls how ARRAY type
values are stored in Druid segments. When set to `'array'` (recommended for SQL
compliance), Druid will store all ARRAY typed values in [ARRAY typed
columns](../querying/arrays.md), and supports storing both VARCHAR and numeric
typed arrays. When set to `'mvd'` (the default, for backwards compatibility),
Druid only supports VARCHAR typed arrays, and will store them as [multi-value
string columns](../querying/multi-value-dimensions.md). When set to `none`,
Druid will throw an exception when trying to store any type of arrays, used to
help migrate operators from `'mvd'` mode to `'array'` mode and force query
writers to make an explicit choice between ARRAY and multi-value VARCHAR typed
columns. | `'mvd'` (for backwards compatibility, recommended to use `array` for
SQL compliance)|
Review Comment:
These are sorted vaguely in order of how commonly people would need to
change them (more common near the top). This one is probably going to be pretty
common, since it's needed for correct SQL behavior with arrays. I suggest
putting it after `finalizeAggregations`. Both of those are pretty big
influences on the meaning of an insert statement.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
Review Comment:
Native batch ingestion, or streaming ingestion.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
Review Comment:
"Ingesting arrays" rather than "Overview"?
In the ToC it will show up right above "Querying arrays" so there's a nice
symmetry there
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
+
+### Filtering
+
+All query types, as well as [filtered
aggregators](aggregations.md#filtered-aggregator), can filter on array typed
columns. Filters follow these rules for array types:
+
+- Value filters, like "equality", "range" match on entire array values
+- The "null" filter will match rows where the entire array value is null
+- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the
behavior specified by those functions
+- All other filters do not directly support ARRAY types
+
+#### Example: equality
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong = ARRAY[1,2,3]
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row3","arrayString":"[]","arrayLong":"[1,2,3]","arrayDouble":"[null,2.2,1.1]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: null
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong is null
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row2","arrayString":"[null,\"b\"]","arrayLong":null,"arrayDouble":"[999.0,null,5.5]"}
+```
+
+#### Example: range
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayString >= ARRAY['a','b']
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: ARRAY_CONTAINS
+```sql
+SELECT *
+FROM "array_example"
+WHERE ARRAY_CONTAINS(arrayString, 'a')
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+### Grouping
+
+When grouping on an array with SQL or a native [groupBy
queries](groupbyquery.md), grouping follows standard SQL behavior and groups on
the entire array as a single value. The [`UNNEST`](sql.md#unnest) function
allows grouping on the individual array elements.
+
+#### Example: SQL grouping query with no filtering
+```sql
+SELECT label, arrayString
+FROM "array_example"
+GROUP BY 1,2
+```
+results in:
+```json lines
+{"label":"row1","arrayString":"[\"a\",\"b\"]"}
+{"label":"row2","arrayString":"[null,\"b\"]"}
+{"label":"row3","arrayString":"[]"}
+{"label":"row4","arrayString":"[\"a\",\"b\"]"}
+{"label":"row5","arrayString":null}
+```
+
+#### Example: SQL grouping query with a filter
+```sql
+SELECT label, arrayString
+FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings)
Review Comment:
is this `CROSS JOIN` meant to be here? it doesn't seem to be doing much if
anything
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
Review Comment:
Use separate `###` sections for "Ingestion specs" and "SQL"
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
+
+### Filtering
+
+All query types, as well as [filtered
aggregators](aggregations.md#filtered-aggregator), can filter on array typed
columns. Filters follow these rules for array types:
+
+- Value filters, like "equality", "range" match on entire array values
+- The "null" filter will match rows where the entire array value is null
+- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the
behavior specified by those functions
+- All other filters do not directly support ARRAY types
Review Comment:
What does "do not directly support" mean? Error? Treated as null? Undefined
behavior? Should be specific.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
Review Comment:
Odd to be mixing SQL terms with native type names here. Possibly switch to
`VARCHAR`, `BIGINT`, and `DOUBLE` for making more sense with SQL. Or mention
both the SQL name and the native name?
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
Review Comment:
"Querying arrays"?
Don't see a reason to capitalize it here
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
Review Comment:
Sort of unclear what the verb "include" refers to. The sentence construction
makes it sound like the arrays themselves must include the context parameter.
But that isn't right. Also, "multi-stage ingestion" isn't a thing 🙂— it's
"SQL-based ingestion" or "multi-stage query".
So, suggestion:
> Arrays can also be inserted with [SQL-based
ingestion](../multi-stage-query/index.md) when you use the context parameter
`"arrayIngestMode": "array"`.
Also link the text `context parameter` to
`docs/multi-stage-query/reference.md#context`.
Also include some text about what will happen if you _don't_ do
`arrayIngestMode: array`. Something like: string arrays will be converted to
multi-value dimensions, and numeric arrays will cause the query to fail with an
error (what error?)
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
+
+### Filtering
+
+All query types, as well as [filtered
aggregators](aggregations.md#filtered-aggregator), can filter on array typed
columns. Filters follow these rules for array types:
+
+- Value filters, like "equality", "range" match on entire array values
Review Comment:
Weird that the text here refers to native query concepts, but the examples
are SQL. Seems like that should be better aligned. One way to do it is to
reword things a bit here, like:
> Native value filters like `equality` and `range` match on entire array
values, as do SQL constructs that plan into these native filters.
Also `equality` should link to `docs/querying/filters.md#equality-filter`,
etc.
##########
docs/querying/sql-data-types.md:
##########
@@ -75,6 +75,17 @@ Casts between two SQL types that have different Druid
runtime types generate a r
If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`,
Druid a substitutes [NULL](#null-values).
When `druid.generic.useDefaultValueForNull = true` (legacy mode), Druid
instead substitutes a default value, including when NULL values cast to
non-nullable types. For example, if `druid.generic.useDefaultValueForNull =
true`, a null VARCHAR cast to BIGINT is converted to a zero.
+## Arrays
+
+Druid supports [ARRAY types](arrays.md), which behave as standard SQL arrays,
where results are grouped by matching entire arrays. The [`UNNEST`
operator](./sql-array-functions.md#unn) can be used to perform operations on
individual array elements, translating each element into a separate row.
+
+ARRAY typed columns can be stored in segments with class JSON based ingestion
using the 'auto' typed dimension schema shared with [schema
auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions)
to detect and ingest arrays as ARRAY typed columns. For [SQL based
ingestion](../multi-stage-query/index.md), the query context parameter
`arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In
Druid 28, the default mode for this parameter is `'mvd'` for backwards
compatibility, which instead can only handle `ARRAY<STRING>` which it stores in
[multi-value string columns](#multi-value-strings).
+
+You can convert multi-value dimensions to standard SQL arrays explicitly with
`MV_TO_ARRAY` or implicitly using [array functions](./sql-array-functions.md).
You can also use the array functions to construct arrays from multiple columns.
+
+Druid serializes `ARRAY` results as a JSON string of the array by default,
which can be controlled by the context parameter
+`sqlStringifyArrays`. When set to `false`, the arrays will instead be returned
as regular JSON arrays instead of in stringified form.
Review Comment:
Surely this is only true for certain result formats? I mean, in `csv`,
everything must be stringified somehow.
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
+
+### Filtering
+
+All query types, as well as [filtered
aggregators](aggregations.md#filtered-aggregator), can filter on array typed
columns. Filters follow these rules for array types:
+
+- Value filters, like "equality", "range" match on entire array values
+- The "null" filter will match rows where the entire array value is null
+- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the
behavior specified by those functions
+- All other filters do not directly support ARRAY types
+
+#### Example: equality
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong = ARRAY[1,2,3]
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row3","arrayString":"[]","arrayLong":"[1,2,3]","arrayDouble":"[null,2.2,1.1]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: null
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayLong is null
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row2","arrayString":"[null,\"b\"]","arrayLong":null,"arrayDouble":"[999.0,null,5.5]"}
+```
+
+#### Example: range
+```sql
+SELECT *
+FROM "array_example"
+WHERE arrayString >= ARRAY['a','b']
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+#### Example: ARRAY_CONTAINS
+```sql
+SELECT *
+FROM "array_example"
+WHERE ARRAY_CONTAINS(arrayString, 'a')
+```
+
+```json lines
+{"__time":"2023-01-01T00:00:00.000Z","label":"row1","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,null,3]","arrayDouble":"[1.1,2.2,null]"}
+{"__time":"2023-01-01T00:00:00.000Z","label":"row4","arrayString":"[\"a\",\"b\"]","arrayLong":"[1,2,3]","arrayDouble":"[]"}
+```
+
+### Grouping
+
+When grouping on an array with SQL or a native [groupBy
queries](groupbyquery.md), grouping follows standard SQL behavior and groups on
the entire array as a single value. The [`UNNEST`](sql.md#unnest) function
allows grouping on the individual array elements.
Review Comment:
grammar: "groupBy query"
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
+
+```json lines
+{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a",
"b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null,
"b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],
"arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a",
"b"], "arrayLong":[1, 2, 3], "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,
"arrayLong":[], "arrayDouble":null}
+```
+
+## Overview
+
+When using [native ingestion](../ingestion/native-batch.md), arrays can be
ingested using the [`"auto"`](../ingestion/ingestion-spec.md#dimension-objects)
type dimension schema which is shared with [type-aware schema
discovery](../ingestion/schema-design.md#type-aware-schema-discovery).
+
+When ingesting from TSV or CSV data, you can specify the array delimiters
using the `listDelimiter` field in the `inputFormat`. JSON data must be
formatted as a JSON array to be ingested as an array type. JSON data does not
require `inputFormat` configuration.
+
+The following shows an example `dimensionsSpec` for native ingestion of the
data used in this document:
+
+```
+"dimensions": [
+ {
+ "type": "auto",
+ "name": "label"
+ },
+ {
+ "type": "auto",
+ "name": "arrayString"
+ },
+ {
+ "type": "auto",
+ "name": "arrayLong"
+ },
+ {
+ "type": "auto",
+ "name": "arrayDouble"
+ }
+],
+```
+
+Arrays can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md), but must include a query context
parameter `"arrayIngestMode":"array"`.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "array_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+PARTITIONED BY DAY
+```
+
+These input arrays can also be grouped for rollup:
+
+```sql
+REPLACE INTO "array_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2023-01-01T00:00:00\",
\"label\": \"row1\", \"arrayString\": [\"a\", \"b\"], \"arrayLong\":[1,
null,3], \"arrayDouble\":[1.1, 2.2, null]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row2\", \"arrayString\": [null, \"b\"],
\"arrayLong\":null, \"arrayDouble\":[999, null, 5.5]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row3\", \"arrayString\": [],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[null, 2.2, 1.1]} \n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row4\", \"arrayString\": [\"a\", \"b\"],
\"arrayLong\":[1, 2, 3], \"arrayDouble\":[]}\n{\"timestamp\":
\"2023-01-01T00:00:00\", \"label\": \"row5\", \"arrayString\": null,
\"arrayLong\":[], \"arrayDouble\":null}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"arrayString",
"type":"ARRAY<STRING>"},{"name":"arrayLong",
"type":"ARRAY<LONG>"},{"name":"arrayDouble", "type":"ARRAY<DOUBLE>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ "arrayString",
+ "arrayLong",
+ "arrayDouble"
+FROM "ext"
+GROUP BY 1,2,3,4,5
+PARTITIONED BY DAY
+```
+
+
+## Querying ARRAYS
+
+### Filtering
+
+All query types, as well as [filtered
aggregators](aggregations.md#filtered-aggregator), can filter on array typed
columns. Filters follow these rules for array types:
+
+- Value filters, like "equality", "range" match on entire array values
+- The "null" filter will match rows where the entire array value is null
+- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the
behavior specified by those functions
Review Comment:
backticks around function names, & link them to the SQL function docs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]