This is an automated email from the ASF dual-hosted git repository.
jonwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-druid.git
The following commit(s) were added to refs/heads/master by this push:
new 3b84246 add SQL docs for multi-value string dimensions (#8011)
3b84246 is described below
commit 3b84246cd6bbc799915c53fda2659b70ea227eb8
Author: Clint Wylie <[email protected]>
AuthorDate: Wed Jul 3 08:22:33 2019 -0700
add SQL docs for multi-value string dimensions (#8011)
* add SQL docs for multi-value string dimensions
* formatting consistency
* fix typo
* adjust
---
docs/content/misc/math-expr.md | 59 ++++++-----
docs/content/querying/multi-value-dimensions.md | 28 ++++--
docs/content/querying/sql.md | 126 +++++++++++++++---------
3 files changed, 126 insertions(+), 87 deletions(-)
diff --git a/docs/content/misc/math-expr.md b/docs/content/misc/math-expr.md
index 2f01069..3fab03a 100644
--- a/docs/content/misc/math-expr.md
+++ b/docs/content/misc/math-expr.md
@@ -54,14 +54,14 @@ begin with a digit. To escape other special characters, you
can quote it with do
For logical operators, a number is true if and only if it is positive (0 or
negative value means false). For string
type, it's the evaluation result of 'Boolean.valueOf(string)'.
-Multi-value string dimensions are supported and may be treated as either
scalar or array typed values. When treated as
-a scalar type, an expression will automatically be transformed to apply the
scalar operation across all values of the
-multi-valued type, to mimic Druid's native behavior. Values that result in
arrays will be coerced back into the native
-Druid string type for aggregation. Druid aggregations on multi-value string
dimensions on the individual values, _not_
-the 'array', behaving similar to the `unnest` operator available in many SQL
dialects. However, by using the
-`array_to_string` function, aggregations may be done on a stringified version
of the complete array, allowing the
-complete row to be preserved. Using `string_to_array` in an expression
post-aggregator, allows transforming the
-stringified dimension back into the true native array type.
+[Multi-value string dimensions](../querying/multi-value-dimensions.html) are
supported and may be treated as either
+scalar or array typed values. When treated as a scalar type, an expression
will automatically be transformed to apply
+the scalar operation across all values of the multi-valued type, to mimic
Druid's native behavior. Values that result in
+arrays will be coerced back into the native Druid string type for aggregation.
Druid aggregations on multi-value string
+dimensions on the individual values, _not_ the 'array', behaving similar to
the `UNNEST` operator available in many SQL
+dialects. However, by using the `array_to_string` function, aggregations may
be done on a stringified version of the
+complete array, allowing the complete row to be preserved. Using
`string_to_array` in an expression post-aggregator,
+allows transforming the stringified dimension back into the true native array
type.
The following built-in functions are available.
@@ -168,31 +168,30 @@ See javadoc of java.lang.Math for detailed explanation
for each function.
| function | description |
| --- | --- |
-| `array(expr1,expr ...)` | constructs an array from the expression arguments,
using the type of the first argument as the output array type |
-| `array_length(arr)` | returns length of array expression |
-| `array_offset(arr,long)` | returns the array element at the 0 based index
supplied, or null for an out of range index|
-| `array_ordinal(arr,long)` | returns the array element at the 1 based index
supplied, or null for an out of range index |
-| `array_contains(arr,expr)` | returns 1 if the array contains the element
specified by expr, or contains all elements specified by expr if expr is an
array, else 0 |
-| `array_overlap(arr1,arr2)` | returns 1 if arr1 and arr2 have any elements in
common, else 0 |
-| `array_offset_of(arr,expr)` | returns the 0 based index of the first
occurrence of expr in the array, or `null` if no matching elements exist in the
array. |
-| `array_ordinal_of(arr,expr)` | returns the 1 based index of the first
occurrence of expr in the array, or `null` if no matching elements exist in the
array. |
-| `array_prepend(expr,arr)` | adds expr to arr at the beginning, the resulting
array type determined by the type of the array |
-| `array_append(arr1,expr)` | appends expr to arr, the resulting array type
determined by the type of the first array |
-| `array_concat(arr1,arr2)` | concatenates 2 arrays, the resulting array type
determined by the type of the first array |
-| `array_slice(arr,start,end)` | return the subarray of arr from the 0 based
index start(inclusive) to end(exclusive), or `null`, if start is less than 0,
greater than length of arr or less than end|
-| `array_to_string(arr,str)` | joins all elements of arr by the delimiter
specified by str |
-| `string_to_array(str1,str2)` | splits str1 into an array on the delimiter
specified by str2 |
-
+| array(expr1,expr ...) | constructs an array from the expression arguments,
using the type of the first argument as the output array type |
+| array_length(arr) | returns length of array expression |
+| array_offset(arr,long) | returns the array element at the 0 based index
supplied, or null for an out of range index|
+| array_ordinal(arr,long) | returns the array element at the 1 based index
supplied, or null for an out of range index |
+| array_contains(arr,expr) | returns 1 if the array contains the element
specified by expr, or contains all elements specified by expr if expr is an
array, else 0 |
+| array_overlap(arr1,arr2) | returns 1 if arr1 and arr2 have any elements in
common, else 0 |
+| array_offset_of(arr,expr) | returns the 0 based index of the first
occurrence of expr in the array, or `-1` or `null` if
`druid.generic.useDefaultValueForNull=false`if no matching elements exist in
the array. |
+| array_ordinal_of(arr,expr) | returns the 1 based index of the first
occurrence of expr in the array, or `-1` or `null` if
`druid.generic.useDefaultValueForNull=false` if no matching elements exist in
the array. |
+| array_prepend(expr,arr) | adds expr to arr at the beginning, the resulting
array type determined by the type of the array |
+| array_append(arr1,expr) | appends expr to arr, the resulting array type
determined by the type of the first array |
+| array_concat(arr1,arr2) | concatenates 2 arrays, the resulting array type
determined by the type of the first array |
+| array_slice(arr,start,end) | return the subarray of arr from the 0 based
index start(inclusive) to end(exclusive), or `null`, if start is less than 0,
greater than length of arr or less than end|
+| array_to_string(arr,str) | joins all elements of arr by the delimiter
specified by str |
+| string_to_array(str1,str2) | splits str1 into an array on the delimiter
specified by str2 |
## Apply Functions
| function | description |
| --- | --- |
-| `map(lambda,arr)` | applies a transform specified by a single argument
lambda expression to all elements of arr, returning a new array |
-| `cartesian_map(lambda,arr1,arr2,...)` | applies a transform specified by a
multi argument lambda expression to all elements of the cartesian product of
all input arrays, returning a new array; the number of lambda arguments and
array inputs must be the same |
-| `filter(lambda,arr)` | filters arr by a single argument lambda, returning a
new array with all matching elements, or null if no elements match |
-| `fold(lambda,arr)` | folds a 2 argument lambda across arr. The first
argument of the lambda is the array element and the second the accumulator,
returning a single accumulated value. |
-| `cartesian_fold(lambda,arr1,arr2,...)` | folds a multi argument lambda
across the cartesian product of all input arrays. The first arguments of the
lambda is the array element and the last is the accumulator, returning a single
accumulated value. |
-| `any(lambda,arr)` | returns 1 if any element in the array matches the lambda
expression, else 0 |
-| `all(lambda,arr)` | returns 1 if all elements in the array matches the
lambda expression, else 0 |
+| map(lambda,arr) | applies a transform specified by a single argument lambda
expression to all elements of arr, returning a new array |
+| cartesian_map(lambda,arr1,arr2,...) | applies a transform specified by a
multi argument lambda expression to all elements of the cartesian product of
all input arrays, returning a new array; the number of lambda arguments and
array inputs must be the same |
+| filter(lambda,arr) | filters arr by a single argument lambda, returning a
new array with all matching elements, or null if no elements match |
+| fold(lambda,arr) | folds a 2 argument lambda across arr. The first argument
of the lambda is the array element and the second the accumulator, returning a
single accumulated value. |
+| cartesian_fold(lambda,arr1,arr2,...) | folds a multi argument lambda across
the cartesian product of all input arrays. The first arguments of the lambda is
the array element and the last is the accumulator, returning a single
accumulated value. |
+| any(lambda,arr) | returns 1 if any element in the array matches the lambda
expression, else 0 |
+| all(lambda,arr) | returns 1 if all elements in the array matches the lambda
expression, else 0 |
diff --git a/docs/content/querying/multi-value-dimensions.md
b/docs/content/querying/multi-value-dimensions.md
index 04c7357..ce29451 100644
--- a/docs/content/querying/multi-value-dimensions.md
+++ b/docs/content/querying/multi-value-dimensions.md
@@ -24,12 +24,15 @@ title: "Multi-value dimensions"
# Multi-value dimensions
-Apache Druid (incubating) supports "multi-value" string dimensions. These are
generated when an input field contains an array of values
-instead of a single value (e.e. JSON arrays, or a TSV field containing one or
more `listDelimiter` characters).
+Apache Druid (incubating) supports "multi-value" string dimensions. These are
generated when an input field contains an
+array of values instead of a single value (e.e. JSON arrays, or a TSV field
containing one or more `listDelimiter`
+characters).
This document describes the behavior of groupBy (topN has similar behavior)
queries on multi-value dimensions when they
are used as a dimension being grouped by. See the section on multi-value
columns in
-[segments](../design/segments.html#multi-value-columns) for internal
representation details.
+[segments](../design/segments.html#multi-value-columns) for internal
representation details. Examples in this document
+are in the form of [native Druid queries](querying.html). Refer to the [Druid
SQL documentation](sql.html) for details
+about using multi-value string dimensions in SQL.
## Querying multi-value dimensions
@@ -109,9 +112,10 @@ This "selector" filter would match row4 of the dataset
above:
### Grouping
topN and groupBy queries can group on multi-value dimensions. When grouping on
a multi-value dimension, _all_ values
-from matching rows will be used to generate one group per value. It's possible
for a query to return more groups than
-there are rows. For example, a topN on the dimension `tags` with filter `"t1"
AND "t3"` would match only row1, and
-generate a result with three groups: `t1`, `t2`, and `t3`. If you only need to
include values that match
+from matching rows will be used to generate one group per value. This can be
thought of as the equivalent to the
+`UNNEST` operator used on an `ARRAY` type that many SQL dialects support. This
means it's possible for a query to return
+more groups than there are rows. For example, a topN on the dimension `tags`
with filter `"t1" AND "t3"` would match
+only row1, and generate a result with three groups: `t1`, `t2`, and `t3`. If
you only need to include values that match
your filter, you can use a [filtered
dimensionSpec](dimensionspecs.html#filtered-dimensionspecs). This can also
improve performance.
@@ -280,11 +284,15 @@ returns following result.
]
```
-You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the
results. It happens because query filter is applied on the row before
explosion. For multi-value dimensions, selector filter for "t3" would match
row1 and row2, after which exploding is done. For multi-value dimensions, query
filter matches a row if any individual value inside the multiple values matches
the query filter.
+You might be surprised to see inclusion of "t1", "t2", "t4" and "t5" in the
results. It happens because query filter is
+applied on the row before explosion. For multi-value dimensions, selector
filter for "t3" would match row1 and row2,
+after which exploding is done. For multi-value dimensions, query filter
matches a row if any individual value inside
+the multiple values matches the query filter.
### Example: GroupBy query with a selector query filter and additional filter
in "dimensions" attributes
-To solve the problem above and to get only rows for "t3" returned, you would
have to use a "filtered dimension spec" as in the query below.
+To solve the problem above and to get only rows for "t3" returned, you would
have to use a "filtered dimension spec" as
+in the query below.
See section on filtered dimensionSpecs in
[dimensionSpecs](dimensionspecs.html#filtered-dimensionspecs) for details.
@@ -337,4 +345,6 @@ returns the following result.
]
```
-Note that, for groupBy queries, you could get similar result with a [having
spec](having.html) but using a filtered dimensionSpec is much more efficient
because that gets applied at the lowest level in the query processing pipeline.
Having specs are applied at the outermost level of groupBy query processing.
+Note that, for groupBy queries, you could get similar result with a [having
spec](having.html) but using a filtered
+dimensionSpec is much more efficient because that gets applied at the lowest
level in the query processing pipeline.
+Having specs are applied at the outermost level of groupBy query processing.
diff --git a/docs/content/querying/sql.md b/docs/content/querying/sql.md
index 1f53c70..92b9e41 100644
--- a/docs/content/querying/sql.md
+++ b/docs/content/querying/sql.md
@@ -42,6 +42,61 @@ queries on the query Broker (the first process you query),
which are then passed
queries. Other than the (slight) overhead of translating SQL on the Broker,
there isn't an additional performance
penalty versus native queries.
+## Data types and casts
+
+Druid natively supports five basic column types: "long" (64 bit signed int),
"float" (32 bit float), "double" (64 bit
+float) "string" (UTF-8 encoded strings and string arrays), and "complex"
(catch-all for more exotic data types like
+hyperUnique and approxHistogram columns).
+
+Timestamps (including the `__time` column) are treated by Druid as longs, with
the value being the number of
+milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds.
Therefore, timestamps in Druid do not carry any
+timezone information, but only carry information about the exact moment in
time they represent. See the
+[Time functions](#time-functions) section for more information about timestamp
handling.
+
+Druid generally treats NULLs and empty strings interchangeably, rather than
according to the SQL standard. As such,
+Druid SQL only has partial support for NULLs. For example, the expressions
`col IS NULL` and `col = ''` are equivalent,
+and both will evaluate to true if `col` contains an empty string. Similarly,
the expression `COALESCE(col1, col2)` will
+return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator
counts all rows, the `COUNT(expr)`
+aggregator will count the number of rows where expr is neither null nor the
empty string. String columns in Druid are
+NULLable. Numeric columns are NOT NULL; if you query a numeric column that is
not present in all segments of your Druid
+datasource, then it will be treated as zero for rows from those segments.
+
+For mathematical operations, Druid SQL will use integer math if all operands
involved in an expression are integers.
+Otherwise, Druid will switch to floating point math. You can force this to
happen by casting one of your operands
+to FLOAT. At runtime, Druid may widen 32-bit floats to 64-bit for certain
operators, like SUM aggregators.
+
+Druid [multi-value string dimensions](multi-value-dimensions.html) will appear
in the table schema as `VARCHAR` typed,
+and may be interacted with in expressions as such. Additionally, they can be
treated as `ARRAY` 'like', via a handful of
+special multi-value operators. Expressions against multi-value string
dimensions will apply the expression to all values
+of the row, however the caveat is that aggregations on these multi-value
string columns will observe the native Druid
+multi-value aggregation behavior, which is equivalent to the `UNNEST` function
available in many dialects.
+Refer to the documentation on [multi-value string
dimensions](multi-value-dimensions.html) and
+[Druid expressions documentation](../misc/math-expr.html) for additional
details.
+
+The following table describes how SQL types map onto Druid types during query
runtime. Casts between two SQL types
+that have the same Druid runtime type will have no effect, other than
exceptions noted in the table. Casts between two
+SQL types that have different Druid runtime types will generate a runtime cast
in Druid. If a value cannot be properly
+cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will
substitute a default value. NULL values cast
+to non-nullable types will also be substitued with a default value (for
example, nulls cast to numbers will be
+converted to zeroes).
+
+|SQL type|Druid runtime type|Default value|Notes|
+|--------|------------------|-------------|-----|
+|CHAR|STRING|`''`||
+|VARCHAR|STRING|`''`|Druid STRING columns are reported as VARCHAR|
+|DECIMAL|DOUBLE|`0.0`|DECIMAL uses floating point, not fixed point math|
+|FLOAT|FLOAT|`0.0`|Druid FLOAT columns are reported as FLOAT|
+|REAL|DOUBLE|`0.0`||
+|DOUBLE|DOUBLE|`0.0`|Druid DOUBLE columns are reported as DOUBLE|
+|BOOLEAN|LONG|`false`||
+|TINYINT|LONG|`0`||
+|SMALLINT|LONG|`0`||
+|INTEGER|LONG|`0`||
+|BIGINT|LONG|`0`|Druid LONG columns (except `__time`) are reported as BIGINT|
+|TIMESTAMP|LONG|`0`, meaning 1970-01-01 00:00:00 UTC|Druid's `__time` column
is reported as TIMESTAMP. Casts between string and timestamp types assume
standard SQL formatting, e.g. `2000-01-02 03:04:05`, _not_ ISO8601 formatting.
For handling other formats, use one of the [time functions](#time-functions)|
+|DATE|LONG|`0`, meaning 1970-01-01|Casting TIMESTAMP to DATE rounds down the
timestamp to the nearest day. Casts between string and date types assume
standard SQL formatting, e.g. `2000-01-02`. For handling other formats, use one
of the [time functions](#time-functions)|
+|OTHER|COMPLEX|none|May represent various Druid column types such as
hyperUnique, approxHistogram, etc|
+
## Query syntax
Each Druid datasource appears as a table in the "druid" schema. This is also
the default schema, so Druid datasources
@@ -269,6 +324,27 @@ simplest way to write literal timestamps in other time
zones is to use TIME_PARS
|`x OR y`|Boolean OR.|
|`NOT x`|Boolean NOT.|
+### Multi-value string functions
+All 'array' references in the multi-value string function documentation can
refer to multi-value string columns or
+`ARRAY` literals.
+
+|Function|Notes|
+|--------|-----|
+| `ARRAY(expr1,expr ...)` | constructs an SQL ARRAY literal from the
expression arguments, using the type of the first argument as the output array
type |
+| `MV_LENGTH(arr)` | returns length of array expression |
+| `MV_OFFSET(arr,long)` | returns the array element at the 0 based index
supplied, or null for an out of range index|
+| `MV_ORDINAL(arr,long)` | returns the array element at the 1 based index
supplied, or null for an out of range index |
+| `MV_CONTAINS(arr,expr)` | returns 1 if the array contains the element
specified by expr, or contains all elements specified by expr if expr is an
array, else 0 |
+| `MV_OVERLAP(arr1,arr2)` | returns 1 if arr1 and arr2 have any elements in
common, else 0 |
+| `MV_OFFSET_OF(arr,expr)` | returns the 0 based index of the first occurrence
of expr in the array, or `-1` or `null` if
`druid.generic.useDefaultValueForNull=false` if no matching elements exist in
the array. |
+| `MV_ORDINAL_OF(arr,expr)` | returns the 1 based index of the first
occurrence of expr in the array, or `-1` or `null` if
`druid.generic.useDefaultValueForNull=false` if no matching elements exist in
the array. |
+| `MV_PREPEND(expr,arr)` | adds expr to arr at the beginning, the resulting
array type determined by the type of the array |
+| `MV_APPEND(arr1,expr)` | appends expr to arr, the resulting array type
determined by the type of the first array |
+| `MV_CONCAT(arr1,arr2)` | concatenates 2 arrays, the resulting array type
determined by the type of the first array |
+| `MV_SLICE(arr,start,end)` | return the subarray of arr from the 0 based
index start(inclusive) to end(exclusive), or `null`, if start is less than 0,
greater than length of arr or less than end|
+| `MV_TO_STRING(arr,str)` | joins all elements of arr by the delimiter
specified by str |
+| `STRING_TO_MV(str1,str2)` | splits str1 into an array on the delimiter
specified by str2 |
+
### Other functions
|Function|Notes|
@@ -280,6 +356,7 @@ simplest way to write literal timestamps in other time
zones is to use TIME_PARS
|`COALESCE(value1, value2, ...)`|Returns the first value that is neither NULL
nor empty string.|
|`NVL(expr,expr-for-null)`|Returns 'expr-for-null' if 'expr' is null (or empty
string for string type).|
|`BLOOM_FILTER_TEST(<expr>, <serialized-filter>)`|Returns true if the value is
contained in the base64 serialized bloom filter. See [bloom filter
extension](../development/extensions-core/bloom-filter.html) documentation for
additional details.|
+
### Unsupported features
Druid does not support all SQL features, including:
@@ -291,57 +368,10 @@ Druid does not support all SQL features, including:
Additionally, some Druid features are not supported by the SQL language. Some
unsupported Druid features include:
-- [Multi-value dimensions](multi-value-dimensions.html).
-- [DataSketches
aggregators](../development/extensions-core/datasketches-extension.html).
+- [Set operations on DataSketches
aggregators](../development/extensions-core/datasketches-extension.html).
- [Spatial filters](../development/geo.html).
- [Query cancellation](querying.html#query-cancellation).
-## Data types and casts
-
-Druid natively supports five basic column types: "long" (64 bit signed int),
"float" (32 bit float), "double" (64 bit
-float) "string" (UTF-8 encoded strings), and "complex" (catch-all for more
exotic data types like hyperUnique and
-approxHistogram columns).
-
-Timestamps (including the `__time` column) are treated by Druid as longs, with
the value being the number of
-milliseconds since 1970-01-01 00:00:00 UTC, not counting leap seconds.
Therefore, timestamps in Druid do not carry any
-timezone information, but only carry information about the exact moment in
time they represent. See the
-[Time functions](#time-functions) section for more information about timestamp
handling.
-
-Druid generally treats NULLs and empty strings interchangeably, rather than
according to the SQL standard. As such,
-Druid SQL only has partial support for NULLs. For example, the expressions
`col IS NULL` and `col = ''` are equivalent,
-and both will evaluate to true if `col` contains an empty string. Similarly,
the expression `COALESCE(col1, col2)` will
-return `col2` if `col1` is an empty string. While the `COUNT(*)` aggregator
counts all rows, the `COUNT(expr)`
-aggregator will count the number of rows where expr is neither null nor the
empty string. String columns in Druid are
-NULLable. Numeric columns are NOT NULL; if you query a numeric column that is
not present in all segments of your Druid
-datasource, then it will be treated as zero for rows from those segments.
-
-For mathematical operations, Druid SQL will use integer math if all operands
involved in an expression are integers.
-Otherwise, Druid will switch to floating point math. You can force this to
happen by casting one of your operands
-to FLOAT. At runtime, Druid may widen 32-bit floats to 64-bit for certain
operators, like SUM aggregators.
-
-The following table describes how SQL types map onto Druid types during query
runtime. Casts between two SQL types
-that have the same Druid runtime type will have no effect, other than
exceptions noted in the table. Casts between two
-SQL types that have different Druid runtime types will generate a runtime cast
in Druid. If a value cannot be properly
-cast to another value, as in `CAST('foo' AS BIGINT)`, the runtime will
substitute a default value. NULL values cast
-to non-nullable types will also be substitued with a default value (for
example, nulls cast to numbers will be
-converted to zeroes).
-
-|SQL type|Druid runtime type|Default value|Notes|
-|--------|------------------|-------------|-----|
-|CHAR|STRING|`''`||
-|VARCHAR|STRING|`''`|Druid STRING columns are reported as VARCHAR|
-|DECIMAL|DOUBLE|`0.0`|DECIMAL uses floating point, not fixed point math|
-|FLOAT|FLOAT|`0.0`|Druid FLOAT columns are reported as FLOAT|
-|REAL|DOUBLE|`0.0`||
-|DOUBLE|DOUBLE|`0.0`|Druid DOUBLE columns are reported as DOUBLE|
-|BOOLEAN|LONG|`false`||
-|TINYINT|LONG|`0`||
-|SMALLINT|LONG|`0`||
-|INTEGER|LONG|`0`||
-|BIGINT|LONG|`0`|Druid LONG columns (except `__time`) are reported as BIGINT|
-|TIMESTAMP|LONG|`0`, meaning 1970-01-01 00:00:00 UTC|Druid's `__time` column
is reported as TIMESTAMP. Casts between string and timestamp types assume
standard SQL formatting, e.g. `2000-01-02 03:04:05`, _not_ ISO8601 formatting.
For handling other formats, use one of the [time functions](#time-functions)|
-|DATE|LONG|`0`, meaning 1970-01-01|Casting TIMESTAMP to DATE rounds down the
timestamp to the nearest day. Casts between string and date types assume
standard SQL formatting, e.g. `2000-01-02`. For handling other formats, use one
of the [time functions](#time-functions)|
-|OTHER|COMPLEX|none|May represent various Druid column types such as
hyperUnique, approxHistogram, etc|
## Query execution
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]