Re: [PR] better documentation for the differences between arrays and mvds (druid)

via GitHub Thu, 26 Oct 2023 15:34:29 -0700


317brian commented on code in PR #15245:
URL: https://github.com/apache/druid/pull/15245#discussion_r1373839768



##########
docs/querying/multi-value-dimensions.md:
##########
@@ -30,21 +30,36 @@ array of values instead of a single value, such as the 
`tags` values in the foll
 {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} 
 ```
 
-This document describes filtering and grouping behavior for multi-value 
dimensions. For information about the internal representation of multi-value 
dimensions, see
+It is important to be aware that multi-value dimensions are distinct from 
[array types](arrays.md), which behave like standard SQL arrays. This document 
describes the behavior of multi-value dimensions, and some additional details 
can be found in the [SQL data type 
documentation](sql-data-types.md#multi-value-strings-behavior).

Review Comment:
   Slight change to emphasize that they're different:
   
   ```suggestion
   It is important to be aware that multi-value dimensions are distinct from 
[array types](arrays.md). While array types behave like standard SQL arrays, 
multi-value dimensions do not. This document describes the behavior of 
multi-value dimensions, and some additional details can be found in the [SQL 
data type documentation](sql-data-types.md#multi-value-strings-behavior).
   ```



##########
docs/querying/multi-value-dimensions.md:
##########
@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions. 
This behavior is contr
 
 See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for 
information on configuring multi-value handling.
 
+Multi-value dimensions can also be inserted with [multi-stage 
ingestion](../multi-stage-query/index.md). The multi-stage query engine does 
not have direct handling of class Druid multi-value dimensions. A special pair 
of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR 
ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable 
handling these types. Multi-value handling is not available when using the 
multi-stage query engine to insert data.

Review Comment:
   ```suggestion
   Multi-value dimensions can also be inserted with [SQL-based ingestion using 
the multi-stage query (MSQ) task engine](../multi-stage-query/index.md). The 
MSQ task engine does not have direct handling of class Druid multi-value 
dimensions. A special pair of functions, `MV_TO_ARRAY` which converts 
multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them 
back into `VARCHAR` exist to enable handling these types. Multi-value handling 
is not available when using the multi-stage query task engine to insert data.
   ```



##########
docs/querying/multi-value-dimensions.md:
##########
@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions. 
This behavior is contr
 
 See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for 
information on configuring multi-value handling.
 
+Multi-value dimensions can also be inserted with [multi-stage 
ingestion](../multi-stage-query/index.md). The multi-stage query engine does 
not have direct handling of class Druid multi-value dimensions. A special pair 
of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR 
ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable 
handling these types. Multi-value handling is not available when using the 
multi-stage query engine to insert data.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "mvd_example" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+      '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", 
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": 
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": 
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", 
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": 
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+      '{"type":"json"}',
+      '[{"name":"timestamp", "type":"STRING"},{"name":"label", 
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+    )
+  )
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "label",
+  ARRAY_TO_MV("tags") AS "tags"
+FROM "ext"
+PARTITIONED BY DAY
+```
 
-## Querying multi-value dimensions
-
-The following sections describe filtering and grouping behavior based on the 
following example data, which includes a multi-value dimension, `tags`.
-
+These input arrays can also be grouped prior to converting into a multi-value 
dimension:
+```sql
+REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+      '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", 
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": 
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": 
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", 
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": 
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+      '{"type":"json"}',
+      '[{"name":"timestamp", "type":"STRING"},{"name":"label", 
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+    )
+  )
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "label",
+  ARRAY_TO_MV("tags") AS "tags",
+  COUNT(*) AS "count"
+FROM "ext"
+GROUP BY 1, 2, "tags"
+PARTITIONED BY DAY
 ```
-{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
-{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
+
+Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we 
only wish to coerce the type _after_ grouping.

Review Comment:
   ```suggestion
   Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we 
only wish to coerce the type _after_ grouping.
   ```



##########
docs/querying/arrays.md:
##########
@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`, 
and `DOUBLE` types. Other more complicated ARRAY types must be stored in 
[nested columns](nested-columns.md). Druid ARRAY types are distinct from 
[multi-value dimension](multi-value-dimensions.md), which have significantly 
different behavior than standard arrays.
+
+This document describes inserting, filtering, and grouping behavior for 
`ARRAY` typed columns.
+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and 
[SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior 
based on the following example data, which includes 3 array typed columns.

Review Comment:
   ```suggestion
   The following sections describe inserting, filtering, and grouping behavior 
based on the following example data, which includes 3 array typed columns:
   ```



##########
docs/querying/multi-value-dimensions.md:
##########
@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions. 
This behavior is contr
 
 See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for 
information on configuring multi-value handling.
 
+Multi-value dimensions can also be inserted with [multi-stage 
ingestion](../multi-stage-query/index.md). The multi-stage query engine does 
not have direct handling of class Druid multi-value dimensions. A special pair 
of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR 
ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable 
handling these types. Multi-value handling is not available when using the 
multi-stage query engine to insert data.
+
+For example, to insert the data used in this document:
+```sql
+REPLACE INTO "mvd_example" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+      '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", 
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": 
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": 
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", 
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": 
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+      '{"type":"json"}',
+      '[{"name":"timestamp", "type":"STRING"},{"name":"label", 
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+    )
+  )
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "label",
+  ARRAY_TO_MV("tags") AS "tags"
+FROM "ext"
+PARTITIONED BY DAY
+```
 
-## Querying multi-value dimensions
-
-The following sections describe filtering and grouping behavior based on the 
following example data, which includes a multi-value dimension, `tags`.
-
+These input arrays can also be grouped prior to converting into a multi-value 
dimension:
+```sql
+REPLACE INTO "mvd_example_rollup" OVERWRITE ALL
+WITH "ext" AS (
+  SELECT *
+  FROM TABLE(
+    EXTERN(
+      '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", 
\"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": 
\"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": 
[\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", 
\"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": 
\"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}',
+      '{"type":"json"}',
+      '[{"name":"timestamp", "type":"STRING"},{"name":"label", 
"type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]'
+    )
+  )
+)
+SELECT
+  TIME_PARSE("timestamp") AS "__time",
+  "label",
+  ARRAY_TO_MV("tags") AS "tags",
+  COUNT(*) AS "count"
+FROM "ext"
+GROUP BY 1, 2, "tags"
+PARTITIONED BY DAY
 ```
-{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
-{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
+
+Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we 
only wish to coerce the type _after_ grouping.
+
+
+The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, 
which is also how a query on a Druid table containing a multi-value dimension 
would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY` 
must be used since the multi-stage engine only supports grouping on multi-value 
dimensions as arrays, and so they must be coerced first. These arrays then must 
be coerced back into `VARCHAR` in the `SELECT` part of the statement with 
`ARRAY_TO_MV`.

Review Comment:
   ```suggestion
   The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, 
which is also how a query on a Druid table containing a multi-value dimension 
would specify the type of the `tags` column. If this is the case, you must use 
`MV_TO_ARRAY` since the MSQ task engine only supports grouping on multi-value 
dimensions as arrays. So, they must be coerced first. These arrays must then be 
coerced back into `VARCHAR` in the `SELECT` part of the statement with 
`ARRAY_TO_MV`.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] better documentation for the differences between arrays and mvds (druid)

Reply via email to