jtuglu1 commented on code in PR #18953: URL: https://github.com/apache/druid/pull/18953#discussion_r2734702058
########## docs/ingestion/input-sources.md: ########## @@ -1063,6 +1063,7 @@ The following is a sample spec for a S3 warehouse source: |icebergCatalog|The JSON Object used to define the catalog that manages the configured Iceberg table.|yes| |warehouseSource|The JSON Object that defines the native input source for reading the data files from the warehouse.|yes| |snapshotTime|Timestamp in ISO8601 DateTime format that will be used to fetch the most recent snapshot as of this time.|no| +|residualFilterMode|Controls how residual filters are handled when filtering on non-partition columns. When an Iceberg filter targets a non-partition column, files may contain rows that don't match the filter (residual rows). Valid values are: `ignore` (default, ingest all rows), `warn` (log a warning but continue), `fail` (fail the ingestion job). Use `fail` to ensure filters only target partition columns.|no| Review Comment: Sure – I think this is already clear in the iceberg.md changes: ``` When an Iceberg filter is applied on a non-partition column, the filtering happens at the file metadata level only (using column statistics). Files that might contain matching rows are returned, but these files may include "residual" rows that don't actually match the filter. These residual rows would be ingested unless filtered by a `transformSpec` filter on the Druid side. To control this behavior, you can set the `residualFilterMode` property on the Iceberg input source: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
