This is an automated email from the ASF dual-hosted git repository.
shetland pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new a366753 Consolidate multi-value dimension doc and highlight
configurability (#11428)
a366753 is described below
commit a366753ba5803a761f9580434a965b814fd78986
Author: sthetland <[email protected]>
AuthorDate: Thu Jul 15 10:19:10 2021 -0700
Consolidate multi-value dimension doc and highlight configurability
(#11428)
* Clarify options for multi-value dims
* Add first example
---
docs/ingestion/data-formats.md | 5 ---
docs/querying/multi-value-dimensions.md | 74 +++++++++++++++++++++++----------
website/i18n/en.json | 6 ++-
3 files changed, 56 insertions(+), 29 deletions(-)
diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
index d431fea..002d6d2 100644
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md
@@ -1408,11 +1408,6 @@ tasks will fail with an exception.
The `columns` field must be included and and ensure that the order of the
fields matches the columns of your input data in the same order.
-### Multi-value dimensions
-
-Dimensions can have multiple values for TSV and CSV data. To specify the
delimiter for a multi-value dimension, set the `listDelimiter` in the
`parseSpec`.
-
-JSON data can contain multi-value dimensions as well. The multiple values for
a dimension must be formatted as a JSON array in the ingested data. No
additional `parseSpec` configuration is needed.
### Regex ParseSpec
diff --git a/docs/querying/multi-value-dimensions.md
b/docs/querying/multi-value-dimensions.md
index 09d319b..b5c23ec 100644
--- a/docs/querying/multi-value-dimensions.md
+++ b/docs/querying/multi-value-dimensions.md
@@ -23,20 +23,48 @@ title: "Multi-value dimensions"
-->
-Apache Druid supports "multi-value" string dimensions. These are generated
when an input field contains an
-array of values instead of a single value (e.g. JSON arrays, or a TSV field
containing one or more `listDelimiter`
-characters). By default Druid ingests the values in alphabetical order, see
[Dimension Objects](../ingestion/index.md#dimension-objects) for configuration.
+Apache Druid supports "multi-value" string dimensions. Multi-value string
dimensions result from input fields that contain an
+array of values instead of a single value, such as the `tags` values in the
following JSON array example:
-This document describes the behavior of groupBy (topN has similar behavior)
queries on multi-value dimensions when they
-are used as a dimension being grouped by. See the section on multi-value
columns in
-[segments](../design/segments.md#multi-value-columns) for internal
representation details. Examples in this document
+```
+{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}
+```
+
+This document describes filtering and grouping behavior for multi-value
dimensions. For information about the internal representation of multi-value
dimensions, see
+[segments documentation](../design/segments.md#multi-value-columns). Examples
in this document
are in the form of [native Druid queries](querying.md). Refer to the [Druid
SQL documentation](sql.md) for details
about using multi-value string dimensions in SQL.
+## Overview
+
+At ingestion time, Druid can detect multi-value dimensions and configure the
`dimensionsSpec` accordingly. It detects JSON arrays or CSV/TSV fields as
multi-value dimensions.
+
+For TSV or CSV data, you can specify the multi-value delimiters using the
`listDelimiter` field in the `parseSpec`. JSON data must be formatted as a JSON
array to be ingested as a multi-value dimension. JSON data does not require
`parseSpec` configuration.
+
+The following shows an example multi-value dimension named `tags` in a
`dimensionsSpec`:
+
+```
+"dimensions": [
+ {
+ "type": "string",
+ "name": "tags",
+ "multiValueHandling": "SORTED_ARRAY",
+ "createBitmapIndex": true
+ }
+],
+```
+
+By default, Druid sorts values in multi-value dimensions. This behavior is
controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field.
Alternatively, you can specify multi-value handling as:
+
+* `SORTED_SET`: results in the removal of duplicate values
+* `ARRAY`: retains the original order of the values
+
+See [Dimension Objects](../ingestion/index.md#dimension-objects) for
information on configuring multi-value handling.
+
+
## Querying multi-value dimensions
-Suppose, you have a dataSource with a segment that contains the following
rows, with a multi-value dimension
-called `tags`.
+The following sections describe filtering and grouping behavior based on the
following example data, which includes a multi-value dimension, `tags`.
```
{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} #row1
@@ -44,6 +72,7 @@ called `tags`.
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]} #row3
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []} #row4
```
+> Be sure to remove the comments before trying out the sample data.
### Filtering
@@ -58,7 +87,7 @@ dimensions. Filters follow these rules on multi-value
dimensions:
underlying filters match that row; "or" matches a row if any underlying
filters match that row; "not" matches a row
if the underlying filter does not match the row.
-For example, this "or" filter would match row1 and row2 of the dataset above,
but not row3:
+The following example illustrates these rules. This query applies an "or"
filter to match row1 and row2 of the dataset above, but not row3:
```
{
@@ -118,7 +147,7 @@ only row1, and generate a result with three groups: `t1`,
`t2`, and `t3`. If you
your filter, you can use a [filtered
dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also
improve performance.
-### Example: GroupBy query with no filtering
+## Example: GroupBy query with no filtering
See [GroupBy querying](groupbyquery.md) for details.
@@ -148,7 +177,7 @@ See [GroupBy querying](groupbyquery.md) for details.
}
```
-returns following result.
+This query returns the following result:
```json
[
@@ -204,9 +233,9 @@ returns following result.
]
```
-notice how original rows are "exploded" into multiple rows and merged.
+Notice that original rows are "exploded" into multiple rows and merged.
-### Example: GroupBy query with a selector query filter
+## Example: GroupBy query with a selector query filter
See [query filters](filters.md) for details of selector query filter.
@@ -241,7 +270,7 @@ See [query filters](filters.md) for details of selector
query filter.
}
```
-returns following result.
+This query returns the following result:
```json
[
@@ -283,17 +312,16 @@ returns following result.
]
```
-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the
results. It happens because query filter is
-applied on the row before explosion. For multi-value dimensions, selector
filter for "t3" would match row1 and row2,
-after which exploding is done. For multi-value dimensions, query filter
matches a row if any individual value inside
+You might be surprised to see "t1", "t2", "t4" and "t5" included in the
results. This is because the query filter is
+applied on the row before explosion. For multi-value dimensions, a selector
filter for "t3" would match row1 and row2,
+after which exploding is done. For multi-value dimensions, a query filter
matches a row if any individual value inside
the multiple values matches the query filter.
-### Example: GroupBy query with a selector query filter and additional filter
in "dimensions" attributes
+## Example: GroupBy query with selector query and dimension filters
-To solve the problem above and to get only rows for "t3" returned, you would
have to use a "filtered dimension spec" as
-in the query below.
+To solve the problem above and to get only rows for "t3", use a "filtered
dimension spec", as in the query below.
-See section on filtered dimensionSpecs in
[dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
+See filtered `dimensionSpecs` in
[dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details.
```json
{
@@ -330,7 +358,7 @@ See section on filtered dimensionSpecs in
[dimensionSpecs](dimensionspecs.md#fil
}
```
-returns the following result.
+This query returns the following result:
```json
[
@@ -345,5 +373,5 @@ returns the following result.
```
Note that, for groupBy queries, you could get similar result with a [having
spec](having.md) but using a filtered
-dimensionSpec is much more efficient because that gets applied at the lowest
level in the query processing pipeline.
+`dimensionSpec` is much more efficient because that gets applied at the lowest
level in the query processing pipeline.
Having specs are applied at the outermost level of groupBy query processing.
diff --git a/website/i18n/en.json b/website/i18n/en.json
index e3c8b34..a777e17 100644
--- a/website/i18n/en.json
+++ b/website/i18n/en.json
@@ -220,7 +220,7 @@
"title": "Amazon Kinesis ingestion",
"sidebar_label": "Amazon Kinesis"
},
- "development/extensions-core/druid-kubernetes": {
+ "development/extensions-core/kubernetes": {
"title": "Kubernetes"
},
"development/extensions-core/lookups-cached-global": {
@@ -327,6 +327,10 @@
"operations/basic-cluster-tuning": {
"title": "Basic cluster tuning"
},
+ "operations/clean-metadata-store": {
+ "title": "Automated cleanup for metadata records",
+ "sidebar_label": "Automated metadata cleanup"
+ },
"operations/deep-storage-migration": {
"title": "Deep storage migration"
},
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]