This is an automated email from the ASF dual-hosted git repository.

317brian pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git


The following commit(s) were added to refs/heads/master by this push:
     new fb83544df87 docs: add caution re: schema auto discovery (#19403)
fb83544df87 is described below

commit fb83544df877746dd336fcb60f6937277750ebb9
Author: 317brian <[email protected]>
AuthorDate: Thu May 14 10:40:20 2026 -0700

    docs: add caution re: schema auto discovery (#19403)
    
    Co-authored-by: Jill Osborne <[email protected]>
    Co-authored-by: Charles Smith <[email protected]>
---
 docs/ingestion/data-formats.md   | 20 +++++++++++++++-----
 docs/ingestion/ingestion-spec.md |  8 ++++++--
 docs/ingestion/schema-design.md  |  9 +++++----
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/docs/ingestion/data-formats.md b/docs/ingestion/data-formats.md
index 9d833b4ef5a..6a2aedc3976 100644
--- a/docs/ingestion/data-formats.md
+++ b/docs/ingestion/data-formats.md
@@ -564,7 +564,7 @@ For example:
 ### Kafka
 
 The `kafka` input format lets you parse the Kafka metadata fields in addition 
to the Kafka payload value contents.
-It should only be used when ingesting from Apache Kafka.
+It should only be used when ingesting from Apache Kafka. 
 
 The `kafka` input format wraps around the payload parsing input format and 
augments the data it outputs with the Kafka event timestamp, topic name, event 
headers, and the key field that itself can be parsed using any available input 
format.
 
@@ -583,6 +583,8 @@ Configure the Kafka `inputFormat` as follows:
 | `headerFormat` | Object | Specifies how to parse the Kafka headers. Supports 
String types. Because Kafka header values are bytes, the parser decodes them as 
UTF-8 encoded strings. To change this behavior, implement your own parser based 
on the encoding style. Change the `encoding` type in `KafkaStringHeaderFormat` 
to match your custom implementation. See [Header format](#header-format) for 
supported encoding formats.| no ||
 | `keyFormat` | [InputFormat](#input-format) | The [input 
format](#input-format) to parse the Kafka key. It only processes the first 
entry of the `inputFormat` field. If your key values are simple strings, you 
can use the `tsv` format to parse them. Note that for `tsv`,`csv`, and `regex` 
formats, you need to provide a `columns` array to make a valid input format. 
Only the first one is used, and its name will be ignored in favor of 
`keyColumnName`. | no ||
 | `keyColumnName` | String | The name of the column for the Kafka key.| no 
|`kafka.key`|
+| `partitionColumnName` | String | The name of the column for the Kafka 
partition number. | no | `kafka.partition` |
+| `offsetColumnName` | String | The name of the column for the Kafka record 
offset. Ingesting this column enables filtering by offset in `transformSpec`, 
which is useful for recovering data from a specific offset range. | no | 
`kafka.offset` |
 
 #### Header format
 
@@ -604,6 +606,8 @@ For example, consider the following structure for a Kafka 
message that represent
 
 - **Kafka timestamp**: `1680795276351`
 - **Kafka topic**: `wiki-edits`
+- **Kafka partition**: `0`
+- **Kafka offset**: `12345`
 - **Kafka headers**:
   - `env=development`
   - `zone=z1`
@@ -632,6 +636,8 @@ You would configure it as follows:
       "columns": ["x"]
     },
     "keyColumnName": "kafka.key",
+    "partitionColumnName": "kafka.partition",
+    "offsetColumnName": "kafka.offset"
   }
 }
 ```
@@ -649,7 +655,9 @@ You would parse the example message as follows:
   "kafka.topic": "wiki-edits",
   "kafka.header.env": "development",
   "kafka.header.zone": "z1",
-  "kafka.key": "wiki-edit"
+  "kafka.key": "wiki-edit",
+  "kafka.partition": 0,
+  "kafka.offset": 12345
 }
 ```
 
@@ -734,6 +742,8 @@ After Druid ingests the data, you can query the Kafka 
metadata columns as follow
 SELECT
   "kafka.header.env",
   "kafka.key",
+  "kafka.partition",
+  "kafka.offset",
   "kafka.timestamp",
   "kafka.topic"
 FROM "wikiticker"
@@ -741,9 +751,9 @@ FROM "wikiticker"
 
 This query returns:
 
-| `kafka.header.env` | `kafka.key` | `kafka.timestamp` | `kafka.topic` |
-|--------------------|-----------|---------------|---------------|
-| `development`      | `wiki-edit` | `1680795276351` | `wiki-edits`  |
+| `kafka.header.env` | `kafka.key` | `kafka.partition` | `kafka.offset` | 
`kafka.timestamp` | `kafka.topic` |
+|--------------------|-----------|-------------------|----------------|---------------|---------------|
+| `development`      |`wiki-edit`|`0`|`12345`| `1680795276351`| `wiki-edits`  |
 
 ### Kinesis
 
diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
index d1f901cc953..ad0dc31bb8a 100644
--- a/docs/ingestion/ingestion-spec.md
+++ b/docs/ingestion/ingestion-spec.md
@@ -188,9 +188,13 @@ Treat `__time` as a millisecond timestamp: the number of 
milliseconds since Jan
 The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is 
responsible for
 configuring [dimensions](./schema-model.md#dimensions).
 
-You can either manually specify the dimensions or take advantage of schema 
auto-discovery where you allow Druid to infer all or some of the schema for 
your data. This means that you don't have to explicitly specify your dimensions 
and their type. 
+You can either manually specify the dimensions or take advantage of type-aware 
schema auto-discovery where you allow Druid to infer all or some of the schema 
for your data. This means that you don't have to explicitly specify your 
dimensions and their type. 
 
-To use schema auto-discovery, set `useSchemaDiscovery` to `true`. 
+:::caution
+When using type-aware schema auto-discovery, Druid discovers the type for all 
dimensions unless you use the `dimensionExclusions` field to explicitly specify 
dimensions to ignore. This helps you control storage costs by preventing Druid 
from unintentionally ingesting dimensions.
+:::
+
+To use type-aware schema auto-discovery, set `useSchemaDiscovery` to `true`. 
 
 Alternatively, you can use the string-based schemaless ingestion where any 
discovered dimensions are treated as strings. To do so, leave 
`useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to 
empty or set the  `includeAllDimensions` property to `true`.
 
diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
index 1bee014dade..89b66a4e4d3 100644
--- a/docs/ingestion/schema-design.md
+++ b/docs/ingestion/schema-design.md
@@ -249,12 +249,13 @@ Druid can infer the schema for your data in one of two 
ways:
 
 #### Type-aware schema discovery
 
-:::info
- Note that using type-aware schema discovery can impact downstream BI tools 
depending on how they handle ARRAY typed columns.
-:::
-
 You can have Druid infer the schema and types for your data partially or fully 
by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or 
no dimensions in the dimensions list. 
 
+Before you use type-aware schema discovery, keep the following in mind:
+
+- There may be an impact on downstream BI tools depending on how they handle 
ARRAY-typed columns.
+- Be aware of all the potential dimensions. Druid discovers all available 
dimensions unless you specify an exclusion list. Without an exclusion list, you 
may ingest more columns than you intend. For example, if you use type-aware 
schema discovery and the Kafka input format, Druid discovers dimensions like 
the Kafka offset and partition unless you add them to the exclusion list.
+
 When performing type-aware schema discovery, Druid can discover all the 
columns of your input data (that are not present in
 the exclusion list). Druid automatically chooses the most appropriate native 
Druid type among `STRING`, `LONG`,
 `DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>` 
for nested data. For input formats with


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to