[druid] branch master updated: Consolidate multi-value dimension doc and highlight configurability (#11428)

shetland Thu, 15 Jul 2021 10:19:40 -0700

This is an automated email from the ASF dual-hosted git repository.

shetland pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new a366753  Consolidate multi-value dimension doc and highlight 
configurability  (#11428)
a366753 is described below

commit a366753ba5803a761f9580434a965b814fd78986
Author: sthetland <[email protected]>
AuthorDate: Thu Jul 15 10:19:10 2021 -0700

    Consolidate multi-value dimension doc and highlight configurability  
(#11428)
    
    * Clarify options for multi-value dims
    * Add first example
---
 docs/ingestion/data-formats.md          |  5 ---
 docs/querying/multi-value-dimensions.md | 74 +++++++++++++++++++++++----------
 website/i18n/en.json                    |  6 ++-
 3 files changed, 56 insertions(+), 29 deletions(-)

diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
index d431fea..002d6d2 100644
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md
@@ -1408,11 +1408,6 @@ tasks will fail with an exception.
 
 The `columns` field must be included and and ensure that the order of the 
fields matches the columns of your input data in the same order.
 
-### Multi-value dimensions
-
-Dimensions can have multiple values for TSV and CSV data. To specify the 
delimiter for a multi-value dimension, set the `listDelimiter` in the 
`parseSpec`.
-
-JSON data can contain multi-value dimensions as well. The multiple values for 
a dimension must be formatted as a JSON array in the ingested data. No 
additional `parseSpec` configuration is needed.
 
 ### Regex ParseSpec
 
diff --git a/docs/querying/multi-value-dimensions.md 
b/docs/querying/multi-value-dimensions.md
index 09d319b..b5c23ec 100644
--- a/docs/querying/multi-value-dimensions.md
+++ b/docs/querying/multi-value-dimensions.md
@@ -23,20 +23,48 @@ title: "Multi-value dimensions"
   -->
 
 
-Apache Druid supports "multi-value" string dimensions. These are generated 
when an input field contains an
-array of values instead of a single value (e.g. JSON arrays, or a TSV field 
containing one or more `listDelimiter`
-characters). By default Druid ingests the values in alphabetical order, see 
[Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
+Apache Druid supports "multi-value" string dimensions. Multi-value string 
dimensions result from input fields that contain an
+array of values instead of a single value, such as the `tags` values in the 
following JSON array example: 
 
-This document describes the behavior of groupBy (topN has similar behavior) 
queries on multi-value dimensions when they
-are used as a dimension being grouped by. See the section on multi-value 
columns in
-[segments](../design/segments.md#multi-value-columns) for internal 
representation details. Examples in this document
+```
+{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} 
+```
+
+This document describes filtering and grouping behavior for multi-value 
dimensions. For information about the internal representation of multi-value 
dimensions, see
+[segments documentation](../design/segments.md#multi-value-columns). Examples 
in this document
 are in the form of [native Druid queries](querying.md). Refer to the [Druid 
SQL documentation](sql.md) for details
 about using multi-value string dimensions in SQL.
 
+## Overview
+
+At ingestion time, Druid can detect multi-value dimensions and configure the 
`dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as 
multi-value dimensions.
+
+For TSV or CSV data, you can specify the multi-value delimiters using the 
`listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON 
array to be ingested as a multi-value dimension. JSON data does not require 
`parseSpec` configuration.
+
+The following shows an example multi-value dimension named `tags` in a 
`dimensionsSpec`:
+
+```
+"dimensions": [
+  {
+    "type": "string",
+    "name": "tags",
+    "multiValueHandling": "SORTED_ARRAY",
+    "createBitmapIndex": true
+  }
+],
+```
+
+By default, Druid sorts values in multi-value dimensions. This behavior is 
controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. 
Alternatively, you can specify multi-value handling as:
+
+* `SORTED_SET`: results in the removal of duplicate values
+* `ARRAY`: retains the original order of the values
+
+See [Dimension Objects](../ingestion/index.md#dimension-objects) for 
information on configuring multi-value handling.
+
+
 ## Querying multi-value dimensions
 
-Suppose, you have a dataSource with a segment that contains the following 
rows, with a multi-value dimension
-called `tags`.
+The following sections describe filtering and grouping behavior based on the 
following example data, which includes a multi-value dimension, `tags`.
 
 ```
 {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
@@ -44,6 +72,7 @@ called `tags`.
 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
 {"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
 ```
+> Be sure to remove the comments before trying out the sample data. 
 
 ### Filtering
 
@@ -58,7 +87,7 @@ dimensions. Filters follow these rules on multi-value 
dimensions:
   underlying filters match that row; "or" matches a row if any underlying 
filters match that row; "not" matches a row
   if the underlying filter does not match the row.
 
-For example, this "or" filter would match row1 and row2 of the dataset above, 
but not row3:
+The following example illustrates these rules. This query applies an "or" 
filter to match row1 and row2 of the dataset above, but not row3:
 
 ```
 {
@@ -118,7 +147,7 @@ only row1, and generate a result with three groups: `t1`, 
`t2`, and `t3`. If you
 your filter, you can use a [filtered 
dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
 improve performance.
 
-### Example: GroupBy query with no filtering
+## Example: GroupBy query with no filtering
 
 See [GroupBy querying](groupbyquery.md) for details.
 
@@ -148,7 +177,7 @@ See [GroupBy querying](groupbyquery.md) for details.
 }
 ```
 
-returns following result.
+This query returns the following result:
 
 ```json
 [
@@ -204,9 +233,9 @@ returns following result.
 ]
 ```
 
-notice how original rows are "exploded" into multiple rows and merged.
+Notice that original rows are "exploded" into multiple rows and merged.
 
-### Example: GroupBy query with a selector query filter
+## Example: GroupBy query with a selector query filter
 
 See [query filters](filters.md) for details of selector query filter.
 
@@ -241,7 +270,7 @@ See [query filters](filters.md) for details of selector 
query filter.
 }
 ```
 
-returns following result.
+This query returns the following result:
 
 ```json
 [
@@ -283,17 +312,16 @@ returns following result.
 ]
 ```
 
-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the 
results. It happens because query filter is
-applied on the row before explosion. For multi-value dimensions, selector 
filter for "t3" would match row1 and row2,
-after which exploding is done. For multi-value dimensions, query filter 
matches a row if any individual value inside
+You might be surprised to see "t1", "t2", "t4" and "t5" included in the 
results. This is because the query filter is
+applied on the row before explosion. For multi-value dimensions, a selector 
filter for "t3" would match row1 and row2,
+after which exploding is done. For multi-value dimensions, a query filter 
matches a row if any individual value inside
 the multiple values matches the query filter.
 
-### Example: GroupBy query with a selector query filter and additional filter 
in "dimensions" attributes
+## Example: GroupBy query with selector query and dimension filters
 
-To solve the problem above and to get only rows for "t3" returned, you would 
have to use a "filtered dimension spec" as
-in the query below.
+To solve the problem above and to get only rows for "t3", use a "filtered 
dimension spec", as in the query below.
 
-See section on filtered dimensionSpecs in 
[dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
+See filtered `dimensionSpecs` in 
[dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
 
 ```json
 {
@@ -330,7 +358,7 @@ See section on filtered dimensionSpecs in 
[dimensionSpecs](dimensionspecs.md#fil
 }
 ```
 
-returns the following result.
+This query returns the following result:
 
 ```json
 [
@@ -345,5 +373,5 @@ returns the following result.
 ```
 
 Note that, for groupBy queries, you could get similar result with a [having 
spec](having.md) but using a filtered
-dimensionSpec is much more efficient because that gets applied at the lowest 
level in the query processing pipeline.
+`dimensionSpec` is much more efficient because that gets applied at the lowest 
level in the query processing pipeline.
 Having specs are applied at the outermost level of groupBy query processing.
diff --git a/website/i18n/en.json b/website/i18n/en.json
index e3c8b34..a777e17 100644
--- a/website/i18n/en.json
+++ b/website/i18n/en.json
@@ -220,7 +220,7 @@
         "title": "Amazon Kinesis ingestion",
         "sidebar_label": "Amazon Kinesis"
       },
-      "development/extensions-core/druid-kubernetes": {
+      "development/extensions-core/kubernetes": {
         "title": "Kubernetes"
       },
       "development/extensions-core/lookups-cached-global": {
@@ -327,6 +327,10 @@
       "operations/basic-cluster-tuning": {
         "title": "Basic cluster tuning"
       },
+      "operations/clean-metadata-store": {
+        "title": "Automated cleanup for metadata records",
+        "sidebar_label": "Automated metadata cleanup"
+      },
       "operations/deep-storage-migration": {
         "title": "Deep storage migration"
       },

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Consolidate multi-value dimension doc and highlight configurability (#11428)

Reply via email to