This is an automated email from the ASF dual-hosted git repository.
317brian pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new fb83544df87 docs: add caution re: schema auto discovery (#19403)
fb83544df87 is described below
commit fb83544df877746dd336fcb60f6937277750ebb9
Author: 317brian <[email protected]>
AuthorDate: Thu May 14 10:40:20 2026 -0700
docs: add caution re: schema auto discovery (#19403)
Co-authored-by: Jill Osborne <[email protected]>
Co-authored-by: Charles Smith <[email protected]>
---
docs/ingestion/data-formats.md | 20 +++++++++++++++-----
docs/ingestion/ingestion-spec.md | 8 ++++++--
docs/ingestion/schema-design.md | 9 +++++----
3 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
index 9d833b4ef5a..6a2aedc3976 100644
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md
@@ -564,7 +564,7 @@ For example:
### Kafka
The `kafka` input format lets you parse the Kafka metadata fields in addition
to the Kafka payload value contents.
-It should only be used when ingesting from Apache Kafka.
+It should only be used when ingesting from Apache Kafka.
The `kafka` input format wraps around the payload parsing input format and
augments the data it outputs with the Kafka event timestamp, topic name, event
headers, and the key field that itself can be parsed using any available input
format.
@@ -583,6 +583,8 @@ Configure the Kafka `inputFormat` as follows:
| `headerFormat` | Object | Specifies how to parse the Kafka headers. Supports
String types. Because Kafka header values are bytes, the parser decodes them as
UTF-8 encoded strings. To change this behavior, implement your own parser based
on the encoding style. Change the `encoding` type in `KafkaStringHeaderFormat`
to match your custom implementation. See [Header format](#header-format) for
supported encoding formats.| no ||
| `keyFormat` | [InputFormat](#input-format) | The [input
format](#input-format) to parse the Kafka key. It only processes the first
entry of the `inputFormat` field. If your key values are simple strings, you
can use the `tsv` format to parse them. Note that for `tsv`,`csv`, and `regex`
formats, you need to provide a `columns` array to make a valid input format.
Only the first one is used, and its name will be ignored in favor of
`keyColumnName`. | no ||
| `keyColumnName` | String | The name of the column for the Kafka key.| no
|`kafka.key`|
+| `partitionColumnName` | String | The name of the column for the Kafka
partition number. | no | `kafka.partition` |
+| `offsetColumnName` | String | The name of the column for the Kafka record
offset. Ingesting this column enables filtering by offset in `transformSpec`,
which is useful for recovering data from a specific offset range. | no |
`kafka.offset` |
#### Header format
@@ -604,6 +606,8 @@ For example, consider the following structure for a Kafka
message that represent
- **Kafka timestamp**: `1680795276351`
- **Kafka topic**: `wiki-edits`
+- **Kafka partition**: `0`
+- **Kafka offset**: `12345`
- **Kafka headers**:
- `env=development`
- `zone=z1`
@@ -632,6 +636,8 @@ You would configure it as follows:
"columns": ["x"]
},
"keyColumnName": "kafka.key",
+ "partitionColumnName": "kafka.partition",
+ "offsetColumnName": "kafka.offset"
}
}
```
@@ -649,7 +655,9 @@ You would parse the example message as follows:
"kafka.topic": "wiki-edits",
"kafka.header.env": "development",
"kafka.header.zone": "z1",
- "kafka.key": "wiki-edit"
+ "kafka.key": "wiki-edit",
+ "kafka.partition": 0,
+ "kafka.offset": 12345
}
```
@@ -734,6 +742,8 @@ After Druid ingests the data, you can query the Kafka
metadata columns as follow
SELECT
"kafka.header.env",
"kafka.key",
+ "kafka.partition",
+ "kafka.offset",
"kafka.timestamp",
"kafka.topic"
FROM "wikiticker"
@@ -741,9 +751,9 @@ FROM "wikiticker"
This query returns:
-| `kafka.header.env` | `kafka.key` | `kafka.timestamp` | `kafka.topic` |
-|--------------------|-----------|---------------|---------------|
-| `development` | `wiki-edit` | `1680795276351` | `wiki-edits` |
+| `kafka.header.env` | `kafka.key` | `kafka.partition` | `kafka.offset` |
`kafka.timestamp` | `kafka.topic` |
+|--------------------|-----------|-------------------|----------------|---------------|---------------|
+| `development` |`wiki-edit`|`0`|`12345`| `1680795276351`| `wiki-edits` |
### Kinesis
diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
index d1f901cc953..ad0dc31bb8a 100644
--- a/docs/ingestion/ingestion-spec.md
+++ b/docs/ingestion/ingestion-spec.md
@@ -188,9 +188,13 @@ Treat `__time` as a millisecond timestamp: the number of
milliseconds since Jan
The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is
responsible for
configuring [dimensions](./schema-model.md#dimensions).
-You can either manually specify the dimensions or take advantage of schema
auto-discovery where you allow Druid to infer all or some of the schema for
your data. This means that you don't have to explicitly specify your dimensions
and their type.
+You can either manually specify the dimensions or take advantage of type-aware
schema auto-discovery where you allow Druid to infer all or some of the schema
for your data. This means that you don't have to explicitly specify your
dimensions and their type.
-To use schema auto-discovery, set `useSchemaDiscovery` to `true`.
+:::caution
+When using type-aware schema auto-discovery, Druid discovers the type for all
dimensions unless you use the `dimensionExclusions` field to explicitly specify
dimensions to ignore. This helps you control storage costs by preventing Druid
from unintentionally ingesting dimensions.
+:::
+
+To use type-aware schema auto-discovery, set `useSchemaDiscovery` to `true`.
Alternatively, you can use the string-based schemaless ingestion where any
discovered dimensions are treated as strings. To do so, leave
`useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to
empty or set the `includeAllDimensions` property to `true`.
diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
index 1bee014dade..89b66a4e4d3 100644
--- a/docs/ingestion/schema-design.md
+++ b/docs/ingestion/schema-design.md
@@ -249,12 +249,13 @@ Druid can infer the schema for your data in one of two
ways:
#### Type-aware schema discovery
-:::info
- Note that using type-aware schema discovery can impact downstream BI tools
depending on how they handle ARRAY typed columns.
-:::
-
You can have Druid infer the schema and types for your data partially or fully
by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or
no dimensions in the dimensions list.
+Before you use type-aware schema discovery, keep the following in mind:
+
+- There may be an impact on downstream BI tools depending on how they handle
ARRAY-typed columns.
+- Be aware of all the potential dimensions. Druid discovers all available
dimensions unless you specify an exclusion list. Without an exclusion list, you
may ingest more columns than you intend. For example, if you use type-aware
schema discovery and the Kafka input format, Druid discovers dimensions like
the Kafka offset and partition unless you add them to the exclusion list.
+
When performing type-aware schema discovery, Druid can discover all the
columns of your input data (that are not present in
the exclusion list). Druid automatically chooses the most appropriate native
Druid type among `STRING`, `LONG`,
`DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>`
for nested data. For input formats with
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]