This is an automated email from the ASF dual-hosted git repository.
cwylie pushed a commit to branch 25.0.0
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/25.0.0 by this push:
new 41257019b5 Update nested columns docs (#13424)
41257019b5 is described below
commit 41257019b599ead100034729126d76e003326778
Author: Jill Osborne <[email protected]>
AuthorDate: Tue Nov 29 21:58:18 2022 +0000
Update nested columns docs (#13424)
* Update nested columns docs
* Update nested-columns.md
---
docs/ingestion/data-formats.md | 4 ++--
docs/ingestion/schema-design.md | 9 ++++-----
docs/querying/nested-columns.md | 20 ++++++++++----------
3 files changed, 16 insertions(+), 17 deletions(-)
diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
index eb08df0cf7..557060a5e6 100644
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md
@@ -606,9 +606,9 @@ For example:
### FlattenSpec
-The `flattenSpec` object bridges the gap between potentially nested input
data, such as Avro or ORC, and Druid's flat data model. It is an object within
the `inputFormat` object.
+You can use the `flattenSpec` object to flatten nested data, as an alternative
to the Druid [nested columns](../querying/nested-columns.md) feature, and for
nested input formats unsupported by the feature. It is an object within the
`inputFormat` object.
-> If you have nested JSON data, you can ingest and store JSON in an Apache
Druid column as a `COMPLEX<json>` data type. See [Nested
columns](../querying/nested-columns.md) for more information.
+See [Nested columns](../querying/nested-columns.md) for information on
ingesting and storing nested data in an Apache Druid column as a
`COMPLEX<json>` data type.
Configure your `flattenSpec` as follows:
diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
index 10e6ea82cd..f006e792bc 100644
--- a/docs/ingestion/schema-design.md
+++ b/docs/ingestion/schema-design.md
@@ -116,14 +116,13 @@ naturally emitted. It is also useful if you want to
combine timeseries and non-t
Similar to log aggregation systems, Druid offers inverted indexes for fast
searching and filtering. Druid's search
capabilities are generally less developed than these systems, and its
analytical capabilities are generally more
developed. The main data modeling differences between Druid and these systems
are that when ingesting data into Druid,
-you must be more explicit. Druid columns have types specific upfront and Druid
does not, at this time, natively support
-nested data.
+you must be more explicit. Druid columns have types specific upfront.
Tips for modeling log data in Druid:
* If you don't know ahead of time what columns you'll want to ingest, use an
empty dimensions list to trigger
[automatic detection of dimension columns](#schema-less-dimensions).
-* If you have nested data, flatten it using a
[`flattenSpec`](./ingestion-spec.md#flattenspec).
+* If you have nested data, you can ingest it using the [nested
columns](../querying/nested-columns.md) feature or flatten it using a
[`flattenSpec`](./ingestion-spec.md#flattenspec).
* Consider enabling [rollup](./rollup.md) if you have mainly analytical use
cases for your log data. This will
mean you lose the ability to retrieve individual events from Druid, but you
potentially gain substantial compression and
query performance boosts.
@@ -198,9 +197,9 @@ like `MILLIS_TO_TIMESTAMP`, `TIME_FLOOR`, and others. If
you're using native Dru
### Nested dimensions
-You can ingest and store nested JSON in a Druid column as a `COMPLEX<json>`
data type. See [Nested columns](../querying/nested-columns.md) for more
information.
+You can ingest and store nested data in a Druid column as a `COMPLEX<json>`
data type. See [Nested columns](../querying/nested-columns.md) for more
information.
-If you want to ingest nested data in a format other than JSON—for
example Avro, ORC, and Parquet—you must use the `flattenSpec` object to
flatten it. For example, if you have data of the following form:
+If you want to ingest nested data in a format unsupported by the nested
columns feature, you must use the `flattenSpec` object to flatten it. For
example, if you have data of the following form:
```json
{ "foo": { "bar": 3 } }
diff --git a/docs/querying/nested-columns.md b/docs/querying/nested-columns.md
index e8dc628c8f..77af91ddff 100644
--- a/docs/querying/nested-columns.md
+++ b/docs/querying/nested-columns.md
@@ -23,17 +23,17 @@ sidebar_label: Nested columns
~ under the License.
-->
-> Nested columns is an experimental feature available starting in Apache Druid
24.0. Like most experimental features, functionality documented on this page is
subject to change in future releases. However, the COMPLEX column type includes
versioning to provide backward compatible support in future releases. We
strongly encourage you to experiment with nested columns in your development
environment to evaluate that they meet your use case. If so, you can use them
in production scenarios. [...]
-
Apache Druid supports directly storing nested data structures in
`COMPLEX<json>` columns. `COMPLEX<json>` columns store a copy of the structured
data in JSON format and specialized internal columns and indexes for nested
literal values—STRING, LONG, and DOUBLE types. An optimized [virtual
column](./virtual-columns.md#nested-field-virtual-column) allows Druid to read
and filter these values at speeds consistent with standard Druid LONG, DOUBLE,
and STRING columns.
Druid [SQL JSON functions](./sql-json-functions.md) allow you to extract,
transform, and create `COMPLEX<json>` values in SQL queries, using the
specialized virtual columns where appropriate. You can use the [JSON nested
columns functions](../misc/math-expr.md#json-functions) in [native
queries](./querying.md) using [expression virtual
columns](./virtual-columns.md#expression-virtual-column), and in native
ingestion with a
[`transformSpec`](../ingestion/ingestion-spec.md#transformspec).
You can use the JSON functions in INSERT and REPLACE statements in SQL-based
ingestion, or in a `transformSpec` in native ingestion as an alternative to
using a [`flattenSpec`](../ingestion/data-formats.md#flattenspec) object to
"flatten" nested data for ingestion.
+Druid supports directly ingesting nested data with the following formats:
JSON, Parquet, Avro, ORC.
+
## Example nested data
-The examples in this topic use the data in
[`nested_example_data.json`](https://static.imply.io/data/nested_example_data.json).
The file contains a simple facsimile of an order tracking and shipping table.
+The examples in this topic use the JSON data in
[`nested_example_data.json`](https://static.imply.io/data/nested_example_data.json).
The file contains a simple facsimile of an order tracking and shipping table.
When pretty-printed, a sample row in `nested_example_data` looks like this:
@@ -63,7 +63,7 @@ When pretty-printed, a sample row in `nested_example_data`
looks like this:
## Native batch ingestion
-For native batch ingestion, you can use the [JSON nested columns
functions](./sql-json-functions.md) to extract nested data as an alternative to
using the [`flattenSpec`](../ingestion/data-formats.md#flattenspec) input
format.
+For native batch ingestion, you can use the [SQL JSON
functions](./sql-json-functions.md) to extract nested data as an alternative to
using the [`flattenSpec`](../ingestion/data-formats.md#flattenspec) input
format.
To configure a dimension as a nested data type, specify the `json` type for
the dimension in the `dimensions` list in the `dimensionsSpec` property of your
ingestion spec.
@@ -124,7 +124,7 @@ For example, the following ingestion spec instructs Druid
to ingest `shipTo` and
### Transform data during batch ingestion
-You can use the [JSON nested columns functions](./sql-json-functions.md) to
transform JSON data and reference the transformed data in your ingestion spec.
+You can use the [SQL JSON functions](./sql-json-functions.md) to transform
nested data and reference the transformed data in your ingestion spec.
To do this, define the output name and expression in the `transforms` list in
the `transformSpec` object of your ingestion spec.
@@ -192,7 +192,7 @@ For example, the following ingestion spec extracts
`firstName`, `lastName` and `
## SQL-based ingestion
-To ingest nested data using multi-stage query architecture, specify
`COMPLEX<json>` as the value for `type` when you define the row
signature—`shipTo` and `details` in the following example ingestion spec:
+To ingest nested data using SQL-based ingestion, specify `COMPLEX<json>` as
the value for `type` when you define the row signature—`shipTo` and
`details` in the following example ingestion spec:

@@ -297,7 +297,7 @@ The [Kafka tutorial](../tutorials/tutorial-kafka.md) guides
you through the step
### Transform data during SQL-based ingestion
-You can use the [JSON nested columns functions](./sql-json-functions.md) to
transform JSON data in your ingestion query.
+You can use the [SQL JSON functions](./sql-json-functions.md) to transform
nested data in your ingestion query.
For example, the following ingestion query is the SQL-based version of the
[previous batch example](#transform-data-during-batch-ingestion)—it
extracts `firstName`, `lastName`, and `address` from `shipTo` and creates a
composite JSON object containing `product`, `details`, and `department`.
@@ -326,7 +326,7 @@ PARTITIONED BY ALL
## Ingest a JSON string as COMPLEX<json\>
-If your source data uses a string representation of your JSON column, you can
still ingest the data as `COMPLEX<JSON>` as follows:
+If your source data contains serialized JSON strings, you can ingest the data
as `COMPLEX<JSON>` as follows:
- During native batch ingestion, call the `parse_json` function in a
`transform` object in the `transformSpec`.
- During SQL-based ingestion, use the PARSE_JSON keyword within your SELECT
statement to transform the string values to JSON.
- If you are concerned that your data may not contain valid JSON, you can use
`try_parse_json` for native batch or `TRY_PARSE_JSON` for SQL-based ingestion.
For cases where the column does not contain valid JSON, Druid inserts a null
value.
@@ -563,7 +563,7 @@ In addition to `JSON_VALUE`, Druid offers a number of
operators that focus on tr
- `PARSE_JSON`
- `TO_JSON_STRING`
-These functions are primarily intended for use with the multi-stage query
architecture to transform data during insert operations, but they also work in
traditional Druid SQL queries. Because most of these functions output JSON
objects, they have the same limitations when used in traditional Druid queries
as interacting with the JSON objects directly.
+These functions are primarily intended for use with SQL-based ingestion to
transform data during insert operations, but they also work in traditional
Druid SQL queries. Because most of these functions output JSON objects, they
have the same limitations when used in traditional Druid queries as interacting
with the JSON objects directly.
#### Example query: Return results in a JSON object
@@ -663,7 +663,7 @@ Before you start using the nested columns feature, consider
the following known
- Directly using `COMPLEX<json>` columns and expressions is not well
integrated into the Druid query engine. It can result in errors or undefined
behavior when grouping and filtering, and when you use `COMPLEX<json>` objects
as inputs to aggregators. As a workaround, consider using `TO_JSON_STRING` to
coerce the values to strings before you perform these operations.
- Directly using array-typed outputs from `JSON_KEYS` and `JSON_PATHS` is
moderately supported by the Druid query engine. You can group on these outputs,
and there are a number of array expressions that can operate on these values,
such as `ARRAY_CONCAT_AGG`. However, some operations are not well defined for
use outside array-specific functions, such as filtering using `=` or `IS NULL`.
- Input validation for JSON SQL operators is currently incomplete, which
sometimes results in undefined behavior or unhelpful error messages.
-- Ingesting JSON columns with a very complex nested structure is potentially
an expensive operation and may require you to tune ingestion tasks and/or
cluster parameters to account for increased memory usage or overall task run
time. When you tune your ingestion configuration, treat each nested literal
field inside a JSON object as a flattened top-level Druid column.
+- Ingesting data with a very complex nested structure is potentially an
expensive operation and may require you to tune ingestion tasks and/or cluster
parameters to account for increased memory usage or overall task run time. When
you tune your ingestion configuration, treat each nested literal field inside
an object as a flattened top-level Druid column.
## Further reading
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]