(druid) branch master updated: docs: Migration guide for MVDs to arrays (#16516)

victoria Thu, 13 Jun 2024 13:08:19 -0700

This is an automated email from the ASF dual-hosted git repository.

victoria pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new 836cdb48a5b docs: Migration guide for MVDs to arrays (#16516)
836cdb48a5b is described below

commit 836cdb48a5b9eaaa5ac299a320fc00981f66379b
Author: Victoria Lim <[email protected]>
AuthorDate: Thu Jun 13 13:05:58 2024 -0700

    docs: Migration guide for MVDs to arrays (#16516)
    
    Co-authored-by: Clint Wylie <[email protected]>
    Co-authored-by: Katya Macedo  <[email protected]>
    Co-authored-by: Benedict Jin <[email protected]>
    Co-authored-by: Katya Macedo <[email protected]>
---
 docs/querying/sql-functions.md       |   8 ++
 docs/querying/sql.md                 |   7 +-
 docs/release-info/migr-mvd-array.md  | 246 +++++++++++++++++++++++++++++++++++
 docs/release-info/migration-guide.md |   9 +-
 website/sidebars.json                |   3 +-
 5 files changed, 264 insertions(+), 9 deletions(-)

diff --git a/docs/querying/sql-functions.md b/docs/querying/sql-functions.md
index 883f3b209ac..7151c23b918 100644
--- a/docs/querying/sql-functions.md
+++ b/docs/querying/sql-functions.md
@@ -1143,6 +1143,14 @@ Adds the expression to the beginning of the array.
 
 Returns a slice of the array from the zero-based start and end indexes.
 
+## MV_TO_ARRAY
+
+`MV_TO_ARRAY(str)`
+
+**Function type:** [Multi-value string](sql-multivalue-string-functions.md)
+
+Converts a multi-value string from a `VARCHAR` to a `VARCHAR ARRAY`.
+
 ## MV_TO_STRING
 
 `MV_TO_STRING(arr, str)`
diff --git a/docs/querying/sql.md b/docs/querying/sql.md
index aa228cf15e6..25de2adec42 100644
--- a/docs/querying/sql.md
+++ b/docs/querying/sql.md
@@ -209,7 +209,10 @@ The UNNEST clause unnests ARRAY typed values. The source 
for UNNEST can be an ar
 The following is the general syntax for UNNEST, specifically a query that 
returns the column that gets unnested:
 
 ```sql
-SELECT column_alias_name FROM datasource CROSS JOIN UNNEST(source_expression1) 
AS table_alias_name1(column_alias_name1) CROSS JOIN UNNEST(source_expression2) 
AS table_alias_name2(column_alias_name2) ...
+SELECT column_alias_name
+FROM datasource
+CROSS JOIN UNNEST(source_expression1) AS table_alias_name1(column_alias_name1)
+CROSS JOIN UNNEST(source_expression2) AS table_alias_name2(column_alias_name2) 
...
 ```
 
 * The `datasource` for UNNEST can be any Druid datasource, such as the 
following:
@@ -405,4 +408,4 @@ To solve this issue, explicitly provide the type of the 
dynamic parameter using
 
 ```
 SELECT * FROM druid.foo WHERE dim1 like CONCAT('%', CAST (? AS VARCHAR), '%')
-```
\ No newline at end of file
+```
diff --git a/docs/release-info/migr-mvd-array.md 
b/docs/release-info/migr-mvd-array.md
new file mode 100644
index 00000000000..6ec9e9ff241
--- /dev/null
+++ b/docs/release-info/migr-mvd-array.md
@@ -0,0 +1,246 @@
+---
+id: migr-mvd-array
+title: "Migration guide: MVDs to arrays"
+sidebar_label: MVDs to arrays
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+
+Druid now supports SQL-compliant [arrays](../querying/arrays.md). We recommend 
using arrays over [multi-value 
dimensions](../querying/multi-value-dimensions.md) (MVDs) whenever possible.
+For new projects and complex use cases involving multiple data types, use 
arrays. Use MVDs for specific use cases, such as operating directly on 
individual elements like regular strings. If your operations involve entire 
arrays of values, including the ordering of values within a row, use arrays 
over MVDs.
+
+## Comparison between arrays and MVDs
+
+The following table compares the general behavior between arrays and MVDs.
+For specific query differences between arrays and MVDs, see [Querying arrays 
and MVDs](#querying-arrays-and-mvds).
+
+|  | Array| MVD |
+|---|---|---|
+| Data types | Supports VARCHAR, BIGINT, and DOUBLE types (ARRAY<STRING\>, 
ARRAY<LONG\>, ARRAY<DOUBLE\>) | Only supports arrays of strings (VARCHAR) |
+| SQL compliance | Behaves like standard SQL arrays with SQL-compliant 
behavior | Behaves like SQL VARCHAR rather than standard SQL arrays and 
requires special SQL functions to achieve array-like behavior. See the 
[examples](#examples). |
+| Ingestion | <ul><li>JSON arrays are ingested as Druid arrays</li><li>Managed 
through the query context parameter `arrayIngestMode` in SQL-based ingestion. 
Supported options are `array`, `mvd`, and `none`. Note that if you set this 
mode to `none`, Druid raises an exception if you try to store any type of 
array.</li></ul> | <ul><li>JSON arrays are ingested as MVDs</li><li>Managed 
using functions like [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) in 
SQL-based ingestion</li></ul> |
+| Filtering and grouping | <ul><li>Filters and groupings match the entire 
array value</li><li>Can be used as GROUP BY keys, grouping based on the entire 
array value</li><li>Use the [UNNEST operator](#group-by-array-elements) to 
group based on individual array elements</li></ul> | <ul><li>Filters match any 
value within the array</li><li>Grouping generates a group for each individual 
value, similar to an implicit UNNEST</li></ul> |
+| Conversion | Convert an MVD to an array using 
[MV_TO_ARRAY](../querying/sql-functions.md#mv_to_array) | Convert an array to 
an MVD using [ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv) |
+
+## Querying arrays and MVDs
+
+In SQL queries, Druid operates on arrays differently than MVDs.
+A value in an array column is treated as a single array entity (SQL ARRAY), 
whereas a value in an MVD column is treated as individual strings (SQL VARCHAR).
+This behavior applies even though multiple string values within the same MVD 
are still stored as a single field in the MVD column.
+
+For example, consider the same value, `['a', 'b', 'c']` ingested into an array 
column and an MVD column.
+In your query, you want to filter results by comparing some value with `['a', 
'b', 'c']`.
+
+* For array columns, Druid only returns the row when an equality filter 
matches the entire array.  
+For example: `WHERE "array_column" = ARRAY['a', 'b', 'c']`.
+
+* For MVD columns, Druid returns the row when an equality filter matches any 
value of the MVD.  
+For example, any of the following filters return the row for the query:  
+`WHERE "mvd_column" = 'a'`  
+`WHERE "mvd_column" = 'b'`  
+`WHERE "mvd_column" = 'c'`
+
+Note this difference between arrays and MVDs when you write queries that 
involve filtering or grouping.
+
+When your query applies both filters and grouping, MVDs may return rows that 
don't seem to match the filter,
+since the grouping occurs after Druid applies the filter. For an example, see 
[Filter and group by array elements](#filter-and-group-by-array-elements).
+
+## Examples
+
+The following examples highlight a few analogous queries between arrays and 
MVDs.
+For more information and examples, see [Querying 
arrays](../querying/arrays.md#querying-arrays) and [Querying multi-value 
dimensions](../querying/multi-value-dimensions.md#querying-multi-value-dimensions).
+
+### Filter by an array element
+
+Filter rows that have a certain value in the array or MVD.
+
+#### Array
+
+```sql
+SELECT label, tags
+FROM "array_example"
+WHERE ARRAY_CONTAINS(tags, 't3')
+```
+
+#### MVD
+
+```sql
+SELECT label, tags
+FROM "mvd_example"
+WHERE tags = 't3'
+```
+
+### Filter by one or more elements
+
+Filter rows for which the array or MVD contains one or more elements.
+Notice that [ARRAY_OVERLAP](../querying/sql-functions.md#array_overlap) checks 
for any overlapping elements, whereas 
[ARRAY_CONTAINS](../querying/sql-functions.md#array_contains) in the previous 
example checks that all elements are included.
+
+#### Array
+
+```sql
+SELECT *
+FROM "array_example"
+WHERE ARRAY_OVERLAP(tags, ARRAY['t1', 't7'])
+```
+
+#### MVD
+
+```sql
+SELECT *
+FROM "mvd_example"
+WHERE tags = 't1' OR tags = 't7'
+```
+
+### Filter using array equality
+
+Filter rows for which the array or MVD is equivalent to a reference array.
+
+#### Array
+
+```sql
+SELECT *
+FROM "array_example"
+WHERE tags = ARRAY['t1', 't2', 't3']
+```
+
+#### MVD
+
+```sql
+SELECT *
+FROM "mvd_example"
+WHERE MV_TO_ARRAY(tags) = ARRAY['t1', 't2', 't3']
+```
+
+### Group results by array
+
+Group results by the array or MVD.
+
+#### Array
+
+```sql
+SELECT label, tags
+FROM "array_example"
+GROUP BY 1, 2
+```
+
+#### MVD
+
+```sql
+SELECT label, MV_TO_ARRAY(tags)
+FROM "mvd_example"
+GROUP BY 1, 2
+```
+
+### Group by array elements
+
+Group results by individual array or MVD elements.
+
+#### Array
+
+```sql
+SELECT label, strings
+FROM "array_example" CROSS JOIN UNNEST(tags) as u(strings)
+GROUP BY 1, 2
+```
+
+#### MVD
+
+```sql
+SELECT label, tags
+FROM "mvd_example"
+GROUP BY 1, 2
+```
+
+### Filter and group by array elements
+
+Filter rows that have a certain value, then group by elements in the array or 
MVD.
+This example illustrates that while the results of filtering may match between 
arrays and MVDs,
+be aware that MVDs implicitly unnest their values so that results differ when 
you also apply a GROUP BY.
+
+For example, consider the queries from [Filter by an array 
element](#filter-by-an-array-element).
+Both queries return the following rows:
+
+```json
+{"label":"row1","tags":["t1","t2","t3"]}
+{"label":"row2","tags":["t3","t4","t5"]}
+```
+
+However, adding `GROUP BY 1, 2` to both queries changes the output.
+The two queries are now:
+
+```sql
+-- Array
+SELECT label, tags
+FROM "array_example"
+WHERE ARRAY_CONTAINS(tags, 't3')
+GROUP BY 1, 2
+
+-- MVD
+SELECT label, tags
+FROM "mvd_example"
+WHERE tags = 't3'
+GROUP BY 1, 2
+```
+
+The array query returns the following:
+
+```json
+{"label":"row1","tags":["t1","t2","t3"]}
+{"label":"row2","tags":["t3","t4","t5"]}
+```
+
+The MVD query returns the following:
+
+```json
+{"label":"row1","tags":"t1"}
+{"label":"row1","tags":"t2"}
+{"label":"row1","tags":"t3"}
+{"label":"row2","tags":"t3"}
+{"label":"row2","tags":"t4"}
+{"label":"row2","tags":"t5"}
+```
+
+The MVD results appear to show four extra rows for which `tags` does not equal 
`t3`.
+However, the rows match the filter based on how Druid evaluates equalities for 
MVDs.
+
+For the equivalent query on MVDs, use the 
[MV_FILTER_ONLY](../querying/sql-functions.md#mv_filter_only) function:
+
+```sql
+SELECT label, MV_FILTER_ONLY(tags, ARRAY['t3'])
+FROM "mvd_example"
+WHERE tags = 't3'
+GROUP BY 1, 2
+```
+
+
+## How to ingest data as arrays
+
+You can ingest arrays in Druid as follows:
+
+* For native batch and streaming ingestion, configure the dimensions in 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec).
+Within `dimensionsSpec`, set `"useSchemaDiscovery": true`, and use 
`dimensions` to list the array inputs with type `auto`.  
+For an example, see [Ingesting arrays: Native batch and streaming 
ingestion](../querying/arrays.md#native-batch-and-streaming-ingestion).
+
+* For SQL-based batch ingestion, include the [query context 
parameter](../multi-stage-query/reference.md#context-parameters) 
`"arrayIngestMode": "array"` and reference the relevant array type (`VARCHAR 
ARRAY`, `BIGINT ARRAY`, or `DOUBLE ARRAY`) in the [EXTEND 
clause](../multi-stage-query/reference.md#extern-function) that lists the 
column names and data types.
+For examples, see [Ingesting arrays: SQL-based 
ingestion](../querying/arrays.md#sql-based-ingestion).
+
+   As a best practice, always use the ARRAY data type in your input schema. If 
you want to ingest MVDs, explicitly wrap the string array in 
[ARRAY_TO_MV](../querying/sql-functions.md#array_to_mv). For an example, see 
[Multi-value dimensions: SQL-based 
ingestion](/querying/multi-value-dimensions.md#sql-based-ingestion).
+
diff --git a/docs/release-info/migration-guide.md 
b/docs/release-info/migration-guide.md
index 7b14e1bd296..ca31fce327f 100644
--- a/docs/release-info/migration-guide.md
+++ b/docs/release-info/migration-guide.md
@@ -25,15 +25,12 @@ description: How to migrate from legacy features to get the 
most from Druid upda
 
 In general, when we introduce new features and behaviors into Apache Druid, we 
make every effort to avoid breaking existing features when introducing new 
behaviors. However, sometimes there are either bugs or performance limitations 
with the old behaviors that are not possible to fix in a backward-compatible 
way. In these cases, we must introduce breaking changes for the future 
maintainability of Druid. 
 
-The guides in this section outline breaking changes introduced in Druid 25 and 
later. Each guide provides instructions to migrate to new features.
+The guides in this section outline breaking changes introduced in Druid 25.0.0 
and later. Each guide provides instructions to migrate to new features.
 
-<!--
 
 ## Migrate to arrays from multi-value dimensions
 
-Druid now supports SQL-compliant array types. Whenever possible, you should 
use the array type over multi-value dimensions. See []()>.
-
--->
+Druid now supports SQL-compliant array types. Whenever possible, you should 
use the array type over multi-value dimensions. See [Migration guide: MVDs to 
arrays](migr-mvd-array.md).
 
 ## Migrate to front-coded dictionary encoding
 
@@ -41,4 +38,4 @@ Druid encodes string columns into dictionaries for better 
compression. Front-cod
 
 ## Migrate to `maxSubqueryBytes` from `maxSubqueryRows`
 
-Druid allows you to set a byte-based limit on subquery size to prevent Brokers 
from running out of memory when handling large subqueries. The byte-based 
subquery limit overrides Druid's row-based subquery limit. We recommend that 
you move towards using byte-based limits starting in Druid 30.0. See [Migration 
guide: subquery limit](migr-subquery-limit.md) for more information.
\ No newline at end of file
+Druid allows you to set a byte-based limit on subquery size to prevent Brokers 
from running out of memory when handling large subqueries. The byte-based 
subquery limit overrides Druid's row-based subquery limit. We recommend that 
you move towards using byte-based limits starting in Druid 30.0.0. See 
[Migration guide: subquery limit](migr-subquery-limit.md) for more information.
diff --git a/website/sidebars.json b/website/sidebars.json
index fb32852629b..fff44585331 100644
--- a/website/sidebars.json
+++ b/website/sidebars.json
@@ -375,7 +375,8 @@
       },
       "items": [
         "release-info/migr-front-coded-dict",
-        "release-info/migr-subquery-limit"
+        "release-info/migr-subquery-limit",
+        "release-info/migr-mvd-array"
       ]
     }
   ]},


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(druid) branch master updated: docs: Migration guide for MVDs to arrays (#16516)

Reply via email to