kfaraz commented on code in PR #13429: URL: https://github.com/apache/druid/pull/13429#discussion_r1030998295
########## docs/next-release-notes.md: ########## @@ -0,0 +1,496 @@ +# New features + +## Updated Kafka support + +Updated the Apache Kafka core dependency to version 3.3.1. + +https://github.com/apache/druid/pull/13176 + +## Query engine + +### BIG_SUM SQL function + +Added SQL function `BIG_SUM` that uses the [Compressed Big Decimal](https://github.com/apache/druid/pull/10705) Druid extension. + +https://github.com/apache/druid/pull/13102 + +### Added Compressed Big Decimal min and max functions + +Added min and max functions for Compressed Big Decimal and exposed these functions via SQL: BIG_MIN and BIG_MAX. + +https://github.com/apache/druid/pull/13141 + +### Metrics used to downsample bucket + +Changed the way the MSQ task engine determines whether or not to downsample data, to improve accuracy. The task engine now uses the number of bytes instead of number of keys. + +https://github.com/apache/druid/pull/12998 + +### MSQ heap footprint + +When determining partition boundaries, the heap footprint of the sketches that MSQ uses is capped at 10% of available memory or 300 MB, whichever is lower. Previously, the cap was strictly 300 MB. + +https://github.com/apache/druid/pull/13274 + +### MSQ Docker improvement + +Enabled MSQ task query engine for Docker by default. + +https://github.com/apache/druid/pull/13069 + +### Improved MSQ warnings + +For disallowed MSQ warnings of certain types, the warning is now surfaced as the error. + +https://github.com/apache/druid/pull/13198 + +### Added support for indexSpec + +The MSQ task engine now supports the `indexSpec` context parameter. This context parameter can also be configured through the web console. + +https://github.com/apache/druid/pull/13275 + +### Added task start status to the worker report + +Added `pendingTasks` and `runningTasks` fields to the worker report for the MSQ task engine. +See [Query task status information](#query-task-status-information) for related web console changes. + +https://github.com/apache/druid/pull/13263 + +### Improved handling of secrets + +When MSQ submits tasks containing SQL with sensitive keys, the keys can get logged in the file. +Druid now masks the sensitive keys in the log files using regular expressions. + +https://github.com/apache/druid/pull/13231 + +### Use worker number to communicate between tasks + +Changed the way WorkerClient communicates between the worker tasks, to abstract away the complexity of resolving the `workerNumber` to the `taskId` from the callers. +Once the WorkerClient writes it's outputs to the durable storage, it adds a file with `__success` in the `workerNumber` output directory for that stage and with its `taskId`. This allows you to determine the worker, which has successfully written its outputs to the durable storage, and differentiate from the partial outputs by orphan or failed worker tasks. + +https://github.com/apache/druid/pull/13062 + +### Sketch merging mode + +When a query requires key statistics to generate partition boundaries, key statistics are gathered by the workers while reading rows from the datasource.You can now configure whether the MSQ task engine does this task in parallel or sequentially. Configure the behavior using `clusterStatisticsMergeMode` context parameter. For more information, see [Sketch merging mode](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#sketch-merging-mode). + +https://github.com/apache/druid/pull/13205 + +## Querying + +### Improvements to querying user experience + +This release includes several improvements for querying: + +* Exposed HTTP response headers for SQL queries (https://github.com/apache/druid/pull/13052) +* Added the `shouldFinalize` feature for HLL and quantiles sketches. Druid will no longer finalize aggregators when: + - aggregators appear in the outer level of a query + - aggregators are used as input to an expression or finalizing-field-access post-aggregator + + To provide backwards compatibility, we added a `sqlFinalizeOuterSketches` query context parameter that restores the old behavior (https://github.com/apache/druid/pull/13247) + +### Enabled async reads for JDBC + +Prevented JDBC timeouts on long queries by returning empty batches when a batch fetch takes too long. Uses an async model to run the result fetch concurrently with JDBC requests. + +https://github.com/apache/druid/pull/13196 + +### Enabled composite approach for checking in-filter values set in column dictionary + +To accommodate large value sets arising from large in-filters or from joins pushed down as in-filters, Druid now uses sorted merge algorithm for merging the set and dictionary for larger values. + +https://github.com/apache/druid/pull/13133 + +### Added new configuration keys to query context security model + +Added the following configuration keys that refine the query context security model controlled by `druid.auth.authorizeQueryContextParams`: +* `druid.auth.unsecuredContextKeys`: The set of query context keys that do not require a security check. +* `druid.auth.securedContextKeys`: The set of query context keys that do require a security check. + +## Nested columns + +### Support for more formats + +Druid nested columns and associated JSON transform functions now supports Avro, ORC, and Parquet. + +https://github.com/apache/druid/pull/13325 + +https://github.com/apache/druid/pull/13375 + +### Refactored a data source before unnest + +When data requires "flattening" during processing, the operator now takes in an array and then flattens the array into N (N=number of elements in the array) rows where each row has one of the values from the array. + +https://github.com/apache/druid/pull/13085 + +## Ingestion + +### Improved filtering for cloud objects + +You can now stop at arbitrary subfolders using glob syntax in the `ioConfig.inputSource.filter` field for native batch ingestion from cloud storage, such as S3. + +https://github.com/apache/druid/pull/13027 + +### CLUSTERED BY limit + +When using the MSQ task engine to ingest data, there is now a 1,500 column limit to the number of columns that can be passed in the CLUSTERED BY clause. + +https://github.com/apache/druid/pull/13352 + +### Async task client for streaming ingestion + +You can now use asynchronous communication with indexing tasks by setting `chatAsync` to true in the `tuningConfig`. Enabling asynchronous communication means that the `chatThreads` property is ignored. + +https://github.com/apache/druid/pull/13354 + +### Improved control for how Druid reads JSON data for streaming ingestion + +You can now better control how Druid reads JSON data for streaming ingestion by setting the following fields in the input format specification: + +* `assumedNewlineDelimited` to parse lines of JSON independently. +* `useJsonNodeReader` to retain valid JSON events when parsing multi-line JSON events when a parsing exception occurs. + +The web console has been updated to include these options. + +https://github.com/apache/druid/pull/13089 + + + +### Kafka Consumer improvement + +Allowed Kafka Consumer's custom deserializer to be configured after its instantiation. + +https://github.com/apache/druid/pull/13097 + +### Kafka supervisor logging + +Kafka supervisor logs are now less noisy. The supervisors now log events at the DEBUG level instead of INFO. + +https://github.com/apache/druid/pull/13392 + +### Fixed Overlord leader election + +Fixed a problem where Overlord leader election failed due to lock reacquisition issues. Druid now fails these tasks and clears all locks so that the Overlord leader election isn't blocked. + +https://github.com/apache/druid/pull/13172 + +### Support for inline protobuf descriptor + +Added a new `inline` type `protoBytesDecoder` that allows a user to pass inline the contents of a Protobuf descriptor file, encoded as a Base64 string. + +https://github.com/apache/druid/pull/13192 + +### Duplicate notices + +For streaming ingestion, notices that are the same as one already in queue won't be enqueued. This will help reduce notice queue size. + +https://github.com/apache/druid/pull/13334 + +### When a Kafka stream becomes inactive, prevent Supervisor from creating new indexing tasks + +Added Idle feature to `SeekableStreamSupervisor` for inactive stream. + +https://github.com/apache/druid/pull/13144 + +### + +### Sampling from stream input now respects the configured timeout + +Fixed a problem where sampling from a stream input, such as Kafka or Kinesis, failed to respect the configured timeout when the stream had no records available. You can now set the maximum amount of time in which the entry iterator will return results. + +https://github.com/apache/druid/pull/13296 + +### Streaming tasks resume on Overlord switch + +Fixed a problem where streaming ingestion tasks continued to run until their duration elapsed after the Overlord leader had issued a pause to the tasks. Now, when the Overlord switch occurs right after it has issued a pause to the task, the task remains in a paused state even after the Overlord re-election. + +https://github.com/apache/druid/pull/13223 + +### Fixed Parquet list conversion + +Fixes an issue with Parquet list conversion, where lists of complex objects could unexpectedly be wrapped in an extra object, appearing as `[{"element":<actual_list_element>},{"element":<another_one>}...]` instead of the direct list. This changes the behavior of the parquet reader for lists of structured objects to be consistent with other parquet logical list conversions. The data is now fetched directly, more closely matching its expected structure. + +https://github.com/apache/druid/pull/13294 + +### Introduced a tree type to flattenSpec + +Introduced a `tree` type to `flattenSpec`. In the event that a simple hierarchical lookup is required, the `tree` type allows for faster JSON parsing than `jq` and `path` parsing types. + +https://github.com/apache/druid/pull/12177 + +## Operations + +### Compaction + +Compaction behavior has changed to improve the amount of time it takes and disk space it takes: + +- When segments need to be fetched, download them one at a time and delete them when Druid is done with them. This still takes time but minimizes the required disk space. +- Don't fetch segments on the main compact task when they aren't needed. If the user provides a full `granularitySpec`, `dimensionsSpec`, and `metricsSpec`, Druid skips fetching segments. + +For more information, see the documentation on [Compaction](https://druid.apache.org/docs/latest/data-management/compaction.html) and [Automatic compaction](https://druid.apache.org/docs/latest/data-management/automatic-compaction.html). + +https://github.com/apache/druid/pull/13280 + +### New metric for segments + +`segment/handoff/time` captures the total time taken for handoff for a given set of published segments. + +https://github.com/apache/druid/pull/13238 + +### Idle configs for the Supervisor + +You can now configure the following properties: + +| Property | Description | Default | +| - | - | -| +|`druid.supervisor.idleConfig.enabled`| (Cluster wide) If `true`, supervisor can become idle if there is no data on input stream/topic for some time.|false| +|`druid.supervisor.idleConfig.inactiveAfterMillis`| (Cluster wide) Supervisor is marked as idle if all existing data has been read from input topic and no new data has been published for `inactiveAfterMillis` milliseconds.|`600_000`| +| `inactiveAfterMillis` | (Individual Supervisor) Supervisor is marked as idle if all existing data has been read from input topic and no new data has been published for `inactiveAfterMillis` milliseconds. | no (default == `600_000`) | + +https://github.com/apache/druid/pull/13311 + +### cachingCost balancer strategy Review Comment: This too could be an item inside `segment loading and balancing improvements` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org