317brian commented on code in PR #15245:
URL: https://github.com/apache/druid/pull/15245#discussion_r1373839768
##########
docs/querying/multi-value-dimensions.md:
##########
@@ -30,21 +30,36 @@ array of values instead of a single value, such as the
`tags` values in the foll
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}
```
-This document describes filtering and grouping behavior for multi-value
dimensions. For information about the internal representation of multi-value
dimensions, see
+It is important to be aware that multi-value dimensions are distinct from
[array types](arrays.md), which behave like standard SQL arrays. This document
describes the behavior of multi-value dimensions, and some additional details
can be found in the [SQL data type
documentation](sql-data-types.md#multi-value-strings-behavior).
Review Comment:
Slight change to emphasize that they're different:
```suggestion
It is important to be aware that multi-value dimensions are distinct from
[array types](arrays.md). While array types behave like standard SQL arrays,
multi-value dimensions do not. This document describes the behavior of
multi-value dimensions, and some additional details can be found in the [SQL
data type documentation](sql-data-types.md#multi-value-strings-behavior).
```
##########
docs/querying/multi-value-dimensions.md:
##########
@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions.
This behavior is contr
See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for
information on configuring multi-value handling.
+Multi-value dimensions can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md). The multi-stage query engine does
not have direct handling of class Druid multi-value dimensions. A special pair
of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR
ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable
handling these types. Multi-value handling is not available when using the
multi-stage query engine to insert data.
Review Comment:
```suggestion
Multi-value dimensions can also be inserted with [SQL-based ingestion using
the multi-stage query (MSQ) task engine](../multi-stage-query/index.md). The
MSQ task engine does not have direct handling of class Druid multi-value
dimensions. A special pair of functions, `MV_TO_ARRAY` which converts
multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them
back into `VARCHAR` exist to enable handling these types. Multi-value handling
is not available when using the multi-stage query task engine to insert data.
```
##########
docs/querying/multi-value-dimensions.md:
##########
@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions.
This behavior is contr
See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for
information on configuring multi-value handling.
+Multi-value dimensions can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md). The multi-stage query engine does
not have direct handling of class Druid multi-value dimensions. A special pair
of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR
ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable
handling these types. Multi-value handling is not available when using the
multi-stage query engine to insert data.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "mvd_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\",
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\":
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\":
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\",
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\":
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ ARRAY_TO_MV("tags") AS "tags"
+FROM "ext"
+PARTITIONED BY DAY
+```
-## Querying multi-value dimensions
-
-The following sections describe filtering and grouping behavior based on the
following example data, which includes a multi-value dimension, `tags`.
-
+These input arrays can also be grouped prior to converting into a multi-value
dimension:
+```sql
+REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\",
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\":
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\":
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\",
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\":
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ ARRAY_TO_MV("tags") AS "tags",
+ COUNT(*) AS "count"
+FROM "ext"
+GROUP BY 1, 2, "tags"
+PARTITIONED BY DAY
```
-{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
-{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]} #row2
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4
+
+Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we
only wish to coerce the type _after_ grouping.
Review Comment:
```suggestion
Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we
only wish to coerce the type _after_ grouping.
```
##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+ ~ Licensed to the Apache Software Foundation (ASF) under one
+ ~ or more contributor license agreements. See the NOTICE file
+ ~ distributed with this work for additional information
+ ~ regarding copyright ownership. The ASF licenses this file
+ ~ to you under the Apache License, Version 2.0 (the
+ ~ "License"); you may not use this file except in compliance
+ ~ with the License. You may obtain a copy of the License at
+ ~
+ ~ http://www.apache.org/licenses/LICENSE-2.0
+ ~
+ ~ Unless required by applicable law or agreed to in writing,
+ ~ software distributed under the License is distributed on an
+ ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ ~ KIND, either express or implied. See the License for the
+ ~ specific language governing permissions and limitations
+ ~ under the License.
+ -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`,
and `DOUBLE` types. Other more complicated ARRAY types must be stored in
[nested columns](nested-columns.md). Druid ARRAY types are distinct from
[multi-value dimension](multi-value-dimensions.md), which have significantly
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns.
Review Comment:
```suggestion
The following sections describe inserting, filtering, and grouping behavior
based on the following example data, which includes 3 array typed columns:
```
##########
docs/querying/multi-value-dimensions.md:
##########
@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions.
This behavior is contr
See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for
information on configuring multi-value handling.
+Multi-value dimensions can also be inserted with [multi-stage
ingestion](../multi-stage-query/index.md). The multi-stage query engine does
not have direct handling of class Druid multi-value dimensions. A special pair
of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR
ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable
handling these types. Multi-value handling is not available when using the
multi-stage query engine to insert data.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "mvd_example" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\",
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\":
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\":
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\",
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\":
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ ARRAY_TO_MV("tags") AS "tags"
+FROM "ext"
+PARTITIONED BY DAY
+```
-## Querying multi-value dimensions
-
-The following sections describe filtering and grouping behavior based on the
following example data, which includes a multi-value dimension, `tags`.
-
+These input arrays can also be grouped prior to converting into a multi-value
dimension:
+```sql
+REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+ SELECT *
+ FROM TABLE(
+ EXTERN(
+ '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\",
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\":
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\":
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\",
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\":
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+ '{"type":"json"}',
+ '[{"name":"timestamp", "type":"STRING"},{"name":"label",
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+ )
+ )
+)
+SELECT
+ TIME_PARSE("timestamp") AS "__time",
+ "label",
+ ARRAY_TO_MV("tags") AS "tags",
+ COUNT(*) AS "count"
+FROM "ext"
+GROUP BY 1, 2, "tags"
+PARTITIONED BY DAY
```
-{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
-{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]} #row2
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4
+
+Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we
only wish to coerce the type _after_ grouping.
+
+
+The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`,
which is also how a query on a Druid table containing a multi-value dimension
would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY`
must be used since the multi-stage engine only supports grouping on multi-value
dimensions as arrays, and so they must be coerced first. These arrays then must
be coerced back into `VARCHAR` in the `SELECT` part of the statement with
`ARRAY_TO_MV`.
Review Comment:
```suggestion
The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`,
which is also how a query on a Druid table containing a multi-value dimension
would specify the type of the `tags` column. If this is the case, you must use
`MV_TO_ARRAY` since the MSQ task engine only supports grouping on multi-value
dimensions as arrays. So, they must be coerced first. These arrays must then be
coerced back into `VARCHAR` in the `SELECT` part of the statement with
`ARRAY_TO_MV`.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]