ektravel commented on code in PR #16412: URL: https://github.com/apache/druid/pull/16412#discussion_r1607176096
########## docs/release-info/release-notes.md: ########## @@ -57,50 +57,741 @@ For tips about how to write a good release note, see [Release notes](https://git This section contains important information about new and existing features. +### Improved native queries + +Native queries can now group on nested columns and arrays. + +[#16068](https://github.com/apache/druid/pull/16068) + +Before realtime segments are pushed to deep storage, they consist of spill files. +Segment metrics such as `query/segment/time` now report on per spill file for a realtime segment, rather than for the entire segment. +This change eliminates the need to materialize results on the heap, which improves the performance of groupBy queries. + +[#15757](https://github.com/apache/druid/pull/15757) + +### Concurrent append and replace improvements + +Improved concurrent replace to work with supervisors using concurrent locks. + +[#15995](https://github.com/apache/druid/pull/15995) + +You can now grant locks with different types (EXCLUSIVE, SHARED, APPEND, REPLACE) for the same interval within a task group to ensure a transition to a newer set of tasks without failure. +Previously, changing lock types in the Supervisor could lead to segment allocation errors due to lock conflicts for the new tasks when the older tasks are still running. + +[#16369](https://github.com/apache/druid/pull/16369) + +### Improved AND filter performance + +Druid query processing now adaptively determines when children of AND filters should compute indexes and when to simply match rows during the scan based on selectivity of other filters. +Known as filter partitioning, it can result in dramatic performance increases, depending on the order of filters in the query. + +For example, take a query like `SELECT SUM(longColumn) FROM druid.table WHERE stringColumn1 = '1000' AND stringColumn2 LIKE '%1%'`. Previously, Druid used indexes when processing filters if they are available. +That's not always ideal; imagine if `stringColumn1 = '1000'` matches 100 rows. With indexes, we have to find every value of `stringColumn2 LIKE '%1%'` that is true to compute the indexes for the filter. If `stringColumn2` has more than 100 values, it ends up being worse than simply checking for a match in those 100 remaining rows. + +With the new logic, Druid now checks the selectivity of indexes as it processes each clause of the AND filter. +If it determines it would take more work to compute the index than to match the remaining rows, Druid skips computing the index. + +The order you write filters in a WHERE clause of a query can improve the performance of your query. +More improvements are coming, but you can try out the existing improvements by reordering a query. +Put indexes that are less intensive to compute such as `IS NULL`, `=`, and comparisons (`>`, `>=,` `<`, and `<=`) near the start of AND filters so that Druid more efficiently processes your queries. +Not ordering your filters in this way won’t degrade performance from previous releases since the fallback behavior is what Druid did previously. + +[#15838](https://github.com/apache/druid/pull/15838) + +### Centralized datasource schema (alpha) + +You can now configure Druid to manage datasource schema centrally on the Coordinator. +Previously, Brokers needed to query data nodes and tasks for segment schemas. +Centralizing datasource schemas can improve startup time for Brokers and the efficiency of your deployment. + +If enabled, the following changes occur: + +- Realtime segment schema changes get periodically pushed to the Coordinator +- Tasks publish segment schemas and metadata to the metadata store +- The Coordinator polls the schema and segment metadata to build datasource schemas +- Brokers fetch datasource schemas from the Coordinator when possible. If not, the Broker builds the schema itself by the existing mechanism of querying Historical services. + +This behavior is currently opt-in. To enable this feature, set the following configs: + +- In your common runtime properties, set `druid.centralizedDatasourceSchema.enabled` to true. +- If you are using MiddleManagers, you also need to set `druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled` to true in your MiddleManager runtime properties. + +You can return to the previous behavior by changing the configs to false. + +You can configure the following properties to control how the Coordinator service handles unused segment schemas: + +|Name|Description|Required|Default| +|-|-|-|-| +|`druid.coordinator.kill.segmentSchema.on`| Boolean value for enabling automatic deletion of unused segment schemas. If set to true, the Coordinator service periodically identifies segment schemas that are not referenced by any used segment and marks them as unused. At a later point, these unused schemas are deleted. | No | True| +|`druid.coordinator.kill.segmentSchema.period`| How often to do automatic deletion of segment schemas in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration format. Value must be equal to or greater than `druid.coordinator.period.metadataStoreManagementPeriod`. Only applies if `druid.coordinator.kill.segmentSchema.on` is set to true.| No| `P1D`| +|`druid.coordinator.kill.segmentSchema.durationToRetain`| [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) duration for the time a segment schema is retained for from when it's marked as unused. Only applies if `druid.coordinator.kill.segmentSchema.on` is set to true.| Yes, if `druid.coordinator.kill.segmentSchema.on` is set to true.| `P90D`| + +In addition, there are new metrics available to monitor the performance of centralized schema management: + +- `metadatacache/schemaPoll/count` +- `metadatacache/schemaPoll/failed` +- `metadatacache/schemaPoll/time` +- `metadacache/init/time` +- `metadatacache/refresh/count` +- `metadatacache/refresh/time` +- `metadatacache/backfill/count` +- `metadatacache/finalizedSegmentMetadata/size` +- `metadatacache/finalizedSegmentMetadata/count` +- `metadatacache/finalizedSchemaPayload/count` +- `metadatacache/temporaryMetadataQueryResults/count` +- `metadatacache/temporaryPublishedMetadataQueryResults/count` + +For more information, see [Metrics](../operations/metrics.md). + +[#15817](https://github.com/apache/druid/pull/15817) + +Also, note the following changes to the default values of segment schema cleanup: + +* The default value for `druid.coordinator.kill.segmentSchema.period` has changes from `PT1H` to `P1D`. +* The default value for `druid.coordinator.kill.segmentSchema.durationToRetain` has changed from `PR6H` to `P90D`. + +[#16354](https://github.com/apache/druid/pull/16354) + +### MSQ support for window functions + +Added support for using window functions with the MSQ task engine as the query engine. + +[#15470](https://github.com/apache/druid/pull/15470) + +### MSQ support for Google Cloud Storage + +You can now export MSQ results to a Google Cloud Storage (GCS) path by passing the function `google()` as an argument to the `EXTERN` function. + +[#16051](https://github.com/apache/druid/pull/16051) + +### RabbitMQ extension + +A new RabbitMQ extension is available as a community contribution. +The RabbitMQ extension (`druid-rabbit-indexing-service`) lets you manage the creation and lifetime of rabbit indexing tasks. These indexing tasks read events from [RabbitMQ](https://www.rabbitmq.com) through [super streams](https://www.rabbitmq.com/docs/streams#super-streams). + +As super streams allow exactly once delivery with full support for partitioning, they are compatible with Druid's modern ingestion algorithm, without the downsides of the prior RabbitMQ firehose. + +Note that this uses the RabbitMQ streams feature and not a conventional exchange. You need to make sure that your messages are in a super stream before consumption. For more information, see [RabbitMQ documentation](https://www.rabbitmq.com/docs). + +[#14137](https://github.com/apache/druid/pull/14137) + ## Functional area and related changes This section contains detailed release notes separated by areas. ### Web console +#### Improved the Supervisors view + +You can now use the **Supervisors** view to dynamically query supervisors and display additional information on newly added columns. + + + +[#16318](https://github.com/apache/druid/pull/16318) + +#### Search in tables and columns + +You can now use the **Query** view to search in tables and columns. + + + +[#15990](https://github.com/apache/druid/pull/15990) + +#### Kafka input format + +Improved how the web console determines the input format for a Kafka source. +Instead of defaulting to the Kafka input format for a Kafka source, the web console now only picks the Kafka input format if it detects any of the following in the Kafka sample: a key, headers, or more than one topic. + +[#16180](https://github.com/apache/druid/pull/16180) + +#### Improved handling of lookups during sampling + +Rather than sending a transform expression containing lookups to the sampler, Druid now substitutes the transform expression with a placeholder. +This prevents the expression from blocking the flow. + + + +[#16234](https://github.com/apache/druid/pull/16234) + #### Other web console improvements -### Ingestion +* Added the fields **Avro bytes decoder** and **Proto bytes decoder** for their input formats [#15950](https://github.com/apache/druid/pull/15950) +* Added support for exporting results for queries that use the MSQ task engine [#15969](https://github.com/apache/druid/pull/15969) +* Fixed an issue with the [Tasks](https://druid.apache.org/docs/latest/operations/web-console#tasks) view returning incorrect values for **Created time** and **Duration** fields after the Overlord restarts [#16228](https://github.com/apache/druid/pull/16228) +* Fixed the Azure icon not rendering in the web console [#16173](https://github.com/apache/druid/pull/16173) +* Fixed the supervisor offset reset dialog in the web console [#16298](https://github.com/apache/druid/pull/16298) +* Improved the user experience when the web console is operating in manual capabilities mode [#16191](https://github.com/apache/druid/pull/16191) +* Improved the web console to detect doubles better [#15998](https://github.com/apache/druid/pull/15998) +* Improved the query timer as follows: + * Timer isn't shown if an error happens + * Timer resets if changing tabs while query is running + * Error state is lost if tab is switched twice + + [#16235](https://github.com/apache/druid/pull/16235) +* The web console now suggests the `azureStorage` input type instead of the deprecated `azure` storage type [#15820](https://github.com/apache/druid/pull/15820) +* The download query detail archive option is now more resilient when the detail archive is incomplete [#16071](https://github.com/apache/druid/pull/16071) +* You can now set `maxCompactionTaskSlots` to zero to stop compaction tasks [#15877](https://github.com/apache/druid/pull/15877) + +### General ingestion + +#### Improved Azure input source + +You can now ingest data from multiple storage accounts using the new `azureStorage` input source schema instead of the now deprecated `azure` input source schema. For example: + +```json +... + "ioConfig": { + "type": "index_parallel", + "inputSource": { + "type": "azureStorage", + "objectGlob": "**.json", + "uris": ["azureStorage://storageAccount/container/prefix1/file.json", "azureStorage://storageAccount/container/prefix2/file2.json"] + }, + "inputFormat": { + "type": "json" + }, + ... + }, +... +``` -#### SQL-based ingestion +[#15630](https://github.com/apache/druid/pull/15630) -##### Other SQL-based ingestion improvements +#### Data management API improvements -#### Streaming ingestion +Improved the [Data management API](https://druid.apache.org/docs/latest/api-reference/data-management-api) as follows: -##### Other streaming ingestion improvements +* Fixed a bug in the `markUsed` and `markUnused` APIs where an empty set of segment IDs would be inconsistently treated as null or non-null in different scenarios [#16145](https://github.com/apache/druid/pull/16145) +* Improved the `markUnused` API endpoint to handle an empty list of segment versions [#16198](https://github.com/apache/druid/pull/16198) +* The `segmentIds` filter in the Data management API payload is now parameterized in the database query [#16174](https://github.com/apache/druid/pull/16174) +* You can now mark segments as used or unused within the specified interval using an optional list of versions. +For example: `(interval, [versions])`. When `versions` is unspecified, all versions of segments in the `interval` are marked as used or unused, preserving the old behavior [#16141](https://github.com/apache/druid/pull/16141) + +#### Nested columns performance improvement + +Nested column serialization now releases nested field compression buffers as soon as the nested field serialization is complete, which requires significantly less direct memory during segment serialization when many nested fields are present. + +[#16076](https://github.com/apache/druid/pull/16076) + +#### Segment allocation + +Druid now associates pending segments with the task groups that created them. + +Associating pending segments with the task groups facilitates clean up of unneeded segments as soon as all tasks in the group exit. +Cleaning up pending segments helps delete entries immediately after tasks exit and can alleviate the load on the metadata store during segment allocation. +This can also help with segment allocation failures due to conflicting pending segments that are no longer needed in some cases. + +The change ensures that an append action upgrades a segment set which corresponds exactly to the pending segment upgrades made by the concurrent replace action, and eliminates any duplication in query results that may occur. + +[#16144](https://github.com/apache/druid/pull/16144) Review Comment: Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
