gianm commented on code in PR #15245: URL: https://github.com/apache/druid/pull/15245#discussion_r1372752376
########## docs/querying/arrays.md: ########## @@ -0,0 +1,228 @@ +--- +id: arrays +title: "Array columns" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + + +Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`, and `DOUBLE` types. Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays. + +This document describes inserting, filtering, and grouping behavior for `ARRAY` typed columns. +Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and [SQL array function reference](sql-array-functions.md) for additional details +about the functions available to use with ARRAY columns and types in SQL. + +The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns. + +```json lines +{"timestamp": "2023-01-01T00:00:00", "label": "row1", "arrayString": ["a", "b"], "arrayLong":[1, null,3], "arrayDouble":[1.1, 2.2, null]} +{"timestamp": "2023-01-01T00:00:00", "label": "row2", "arrayString": [null, "b"], "arrayLong":null, "arrayDouble":[999, null, 5.5]} +{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [], "arrayLong":[1, 2, 3], "arrayDouble":[null, 2.2, 1.1]} +{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a", "b"], "arrayLong":[1, 2, 3], "arrayDouble":[]} +{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null, "arrayLong":[], "arrayDouble":null} +``` Review Comment: Somewhere around here we should have a section "String arrays vs. multi-value dimensions" that sets people straight about the differences. Suggested text: > Avoid confusing string arrays with multi-value dimensions (link to MVD docs). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature. > > Use care during ingestion to ensure you get the type you want. > > To get arrays when performing an ingestion using JSON ingestion specs, such as native batch (link) or streaming ingestion (link), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a SQL-based ingestion, write a query that generates arrays and set the context parameter `arrayIngestMode: array`. Arrays may contain strings or numbers. > > To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a SQL-based ingestion, wrap arrays in `ARRAY_TO_MV` (link to examples), which ensures you get multi-value dimensions in any `arrayIngestMode`. Multi-value dimensions can only contain strings. > > You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like `SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'mytable'`. Arrays are type `ARRAY`, multi-value strings are type `VARCHAR`. I suggest including the same exact text in `multi-value-dimensions.md`, or at least linking to this section prominently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
