Re: [PR] Druid 30.0.0 release notes (druid)

via GitHub Sun, 26 May 2024 23:02:15 -0700


cryptoe commented on code in PR #16412:
URL: https://github.com/apache/druid/pull/16412#discussion_r1615511423



##########
docs/release-info/release-notes.md:
##########
@@ -57,50 +57,680 @@ For tips about how to write a good release note, see 
[Release notes](https://git
 
 This section contains important information about new and existing features.
 
+### Improved native queries
+
+Native queries can now group on supported complex columns and nested arrays.
+
+[#16068](https://github.com/apache/druid/pull/16068)
+
+Before realtime segments are pushed to deep storage, they consist of spill 
files.
+Segment metrics such as `query/segment/time` now report on per spill file for 
a realtime segment, rather than for the entire segment.
+This change eliminates the need to materialize results on the heap, which 
improves the performance of groupBy queries.
+
+[#15757](https://github.com/apache/druid/pull/15757)
+
+### Concurrent append and replace improvements
+
+Streaming ingestion supervisors now support concurrent append, that is 
streaming tasks can run concurrently with a replace task (compaction or 
re-indexing) if it also happens to be using concurrent locks. Set the context 
parameter `useConcurrentLocks` to true to enable concurrent append.
+
+Once you update the supervisor to have `"useConcurrentLocks": true`, the 
transition to concurrent append happens seamlessly without causing any 
ingestion lag or task failures.
+
+[#16369](https://github.com/apache/druid/pull/16369)
+
+Druid now performs active cleanup of stale pending segments by tracking the 
set of tasks using such pending segments.
+This allows concurrent append and replace to upgrade only a minimal set of 
pending segments and thus improve performance and eliminate errors.
+Additionally, it helps in reducing load on the metadata store.
+
+[#16144](https://github.com/apache/druid/pull/16144)
+
+### Improved AND filter performance
+
+Druid query processing now adaptively determines when children of AND filters 
should compute indexes and when to simply match rows during the scan based on 
selectivity of other filters.
+Known as filter partitioning, it can result in dramatic performance increases, 
depending on the order of filters in the query.
+
+For example, take a query like `SELECT SUM(longColumn) FROM druid.table WHERE 
stringColumn1 = '1000' AND stringColumn2 LIKE '%1%'`. Previously, Druid used 
indexes when processing filters if they are available.
+That's not always ideal; imagine if `stringColumn1 = '1000'` matches 100 rows. 
With indexes, we have to find every value of `stringColumn2 LIKE '%1%'` that is 
true to compute the indexes for the filter. If `stringColumn2` has more than 
100 values, it ends up being worse than simply checking for a match in those 
100 remaining rows.
+
+With the new logic, Druid now checks the selectivity of indexes as it 
processes each clause of the AND filter.
+If it determines it would take more work to compute the index than to match 
the remaining rows, Druid skips computing the index.
+
+The order you write filters in a WHERE clause of a query can improve the 
performance of your query.
+More improvements are coming, but you can try out the existing improvements by 
reordering a query.
+Put indexes that are less intensive to compute such as `IS NULL`, `=`, and 
comparisons (`>`, `>=,` `<`, and `<=`) near the start of AND filters so that 
Druid more efficiently processes your queries.
+Not ordering your filters in this way won’t degrade performance from previous 
releases since the fallback behavior is what Druid did previously.
+
+[#15838](https://github.com/apache/druid/pull/15838)
+
+### Centralized datasource schema (alpha)
+
+You can now configure Druid to manage datasource schema centrally on the 
Coordinator.
+Previously, Brokers needed to query data nodes and tasks for segment schemas.
+Centralizing datasource schemas can improve startup time for Brokers and the 
efficiency of your deployment.
+
+To enable this feature, set the following configs:
+
+- In your common runtime properties, set 
`druid.centralizedDatasourceSchema.enabled` to true.
+- If you are using MiddleManagers, you also need to set 
`druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled` to true 
in your MiddleManager runtime properties.
+
+[#15817](https://github.com/apache/druid/pull/15817)
+
+### MSQ support for window functions
+
+You can now run window functions in the MSQ task engine using the context flag 
`enableWindowing:true`.
+
+In the native engine, you must use a group by clause to enable window 
functions. This requirement is removed in the MSQ task engine.
+
+[#15470](https://github.com/apache/druid/pull/15470)
+[#16229](https://github.com/apache/druid/pull/16229)
+
+### MSQ support for Google Cloud Storage
+
+You can now export MSQ results to a Google Cloud Storage (GCS) path by passing 
the function `google()` as an argument to the `EXTERN` function.
+
+[#16051](https://github.com/apache/druid/pull/16051)
+
+### RabbitMQ extension
+
+A new RabbitMQ extension is available as a community contribution.
+The RabbitMQ extension (`druid-rabbit-indexing-service`) lets you manage the 
creation and lifetime of rabbit indexing tasks. These indexing tasks read 
events from [RabbitMQ](https://www.rabbitmq.com) through [super 
streams](https://www.rabbitmq.com/docs/streams#super-streams).
+
+As super streams allow exactly once delivery with full support for 
partitioning, they are compatible with Druid's modern ingestion algorithm, 
without the downsides of the prior RabbitMQ firehose.
+
+Note that this uses the RabbitMQ streams feature and not a conventional 
exchange. You need to make sure that your messages are in a super stream before 
consumption. For more information, see [RabbitMQ 
documentation](https://www.rabbitmq.com/docs).
+
+[#14137](https://github.com/apache/druid/pull/14137)
+
 ## Functional area and related changes
 
 This section contains detailed release notes separated by areas.
 
 ### Web console
 
+#### Improved the Supervisors view
+
+You can now use the **Supervisors** view to dynamically query supervisors and 
display additional information on newly added columns.
+
+![Surface more information on the supervisors 
view](./assets/30.0.0-supervisors.png)
+
+[#16318](https://github.com/apache/druid/pull/16318)
+
+#### Search in tables and columns
+
+You can now use the **Query** view to search in tables and columns.
+
+![Use the sidebar to search in tables and columns in Query 
view](./assets/30.0.0-console-search.png)
+
+[#15990](https://github.com/apache/druid/pull/15990)
+
+#### Kafka input format
+
+Improved how the web console determines the input format for a Kafka source.
+Instead of defaulting to the Kafka input format for a Kafka source, the web 
console now only picks the Kafka input format if it detects any of the 
following in the Kafka sample: a key, headers, or more than one topic.
+
+[#16180](https://github.com/apache/druid/pull/16180)
+
+#### Improved handling of lookups during sampling
+
+Rather than sending a transform expression containing lookups to the sampler, 
Druid now substitutes the transform expression with a placeholder.
+This prevents the expression from blocking the flow.
+
+![Change the transform expression to a 
placeholder](./assets/30.0.0-sampler-lookups.png)
+
+[#16234](https://github.com/apache/druid/pull/16234)
+
 #### Other web console improvements
 
-### Ingestion
+* Added the fields **Avro bytes decoder** and **Proto bytes decoder** for 
their input formats [#15950](https://github.com/apache/druid/pull/15950)
+* Fixed an issue with the 
[Tasks](https://druid.apache.org/docs/latest/operations/web-console#tasks) view 
returning incorrect values for **Created time** and **Duration** fields after 
the Overlord restarts [#16228](https://github.com/apache/druid/pull/16228)
+* Fixed the Azure icon not rendering in the web console 
[#16173](https://github.com/apache/druid/pull/16173)
+* Fixed the supervisor offset reset dialog in the web console 
[#16298](https://github.com/apache/druid/pull/16298)
+* Improved the user experience when the web console is operating in manual 
capabilities mode [#16191](https://github.com/apache/druid/pull/16191)
+* Improved the query timer as follows:
+  * Timer isn't shown if an error happens
+  * Timer resets if changing tabs while query is running
+  * Error state is lost if tab is switched twice
+
+  [#16235](https://github.com/apache/druid/pull/16235)
+* The web console now suggests the `azureStorage` input type instead of the 
deprecated `azure` storage type 
[#15820](https://github.com/apache/druid/pull/15820)
+* The download query detail archive option is now more resilient when the 
detail archive is incomplete 
[#16071](https://github.com/apache/druid/pull/16071)
+* You can now set `maxCompactionTaskSlots` to zero to stop compaction tasks 
[#15877](https://github.com/apache/druid/pull/15877)
+
+### General ingestion
+
+#### Improved Azure input source
+
+You can now ingest data from multiple storage accounts using the new 
`azureStorage` input source schema instead of the now deprecated `azure` input 
source schema. For example:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+[#15630](https://github.com/apache/druid/pull/15630)
+
+#### Data management API improvements
+
+Improved the [Data management 
API](https://druid.apache.org/docs/latest/api-reference/data-management-api) as 
follows:
 
-#### SQL-based ingestion
+* Fixed a bug in the `markUsed` and `markUnused` APIs where an empty set of 
segment IDs would be inconsistently treated as null or non-null in different 
scenarios [#16145](https://github.com/apache/druid/pull/16145)
+* Improved the `markUnused` API endpoint to handle an empty list of segment 
versions [#16198](https://github.com/apache/druid/pull/16198)
+* The `segmentIds` filter in the Data management API payload is now 
parameterized in the database query 
[#16174](https://github.com/apache/druid/pull/16174)
+* You can now mark segments as used or unused within the specified interval 
using an optional list of versions.
+For example: `(interval, [versions])`. When `versions` is unspecified, all 
versions of segments in the `interval` are marked as used or unused, preserving 
the old behavior [#16141](https://github.com/apache/druid/pull/16141)
 
-##### Other SQL-based ingestion improvements
+#### Nested columns performance improvement
 
-#### Streaming ingestion
+Nested column serialization now releases nested field compression buffers as 
soon as the nested field serialization is complete, which requires 
significantly less direct memory during segment serialization when many nested 
fields are present.
 
-##### Other streaming ingestion improvements
+[#16076](https://github.com/apache/druid/pull/16076)
+
+#### Improved task context reporting
+
+Added a new field `taskContext` in the task reports of non-MSQ tasks. The 
change is backward compatible. The payload of this field contains the entire 
context used by the task during its runtime.
+
+Added a new experimental interface `TaskContextEnricher` to enrich context 
with use case specific logic.
+
+[#16041](https://github.com/apache/druid/pull/16041)
+
+#### Other ingestion improvements
+
+* Added indexer level task metrics to provide more visibility in task 
distribution [#15991](https://github.com/apache/druid/pull/15991)
+* Added more logging detail for S3 `RetryableS3OutputStream`&mdash;this can 
help to determine whether to adjust chunk size 
[#16117](https://github.com/apache/druid/pull/16117)
+* Added error code to failure type `InternalServerError` 
[#16186](https://github.com/apache/druid/pull/16186)
+* Added a new index for pending segments table for datasource and 
`task_allocator_id` columns [#16355](https://github.com/apache/druid/pull/16355)
+* Fixed a bug in the `MarkOvershadowedSegmentsAsUnused` Coordinator duty to 
also consider segments that are overshadowed by a segment that requires zero 
replicas [#16181](https://github.com/apache/druid/pull/16181)
+* Fixed a bug where `numSegmentsKilled` is reported incorrectly 
[#16103](https://github.com/apache/druid/pull/16103)
+* Fixed a bug where completion task reports are not being generated on 
`index_parallel` tasks [#16042](https://github.com/apache/druid/pull/16042)
+* Fixed an issue where concurrent replace skipped intervals locked by append 
locks during compaction [#16316](https://github.com/apache/druid/pull/16316)
+* Improved error messages when supervisor's checkpoint state is invalid 
[#16208](https://github.com/apache/druid/pull/16208)
+* Improved serialization of `TaskReportMap` 
[#16217](https://github.com/apache/druid/pull/16217)
+* Improved compaction segment read and published fields to include sequential 
compaction tasks [#16171](https://github.com/apache/druid/pull/16171)
+* Improved kill task so that it now accepts an optional list of unused segment 
versions to delete [#15994](https://github.com/apache/druid/pull/15994)
+* Improved logging when ingestion tasks try to get lookups from the 
Coordinator at startup [#16287](https://github.com/apache/druid/pull/16287)
+* Improved ingestion performance by parsing an input stream directly instead 
of converting it to a string and parsing the string as JSON 
[#15693](https://github.com/apache/druid/pull/15693)
+* Improved the creation of input row filter predicate in various batch tasks 
[#16196](https://github.com/apache/druid/pull/16196)
+* Improved how Druid fetches tasks from the Overlord to redact credentials 
[#16182](https://github.com/apache/druid/pull/16182)
+* Optimized `isOvershadowed` when there is a unique minor version for an 
interval [#15952](https://github.com/apache/druid/pull/15952)
+* Removed `EntryExistsException` thrown when trying to insert a duplicate task 
in the metadata store&mdash;Druid now throws a `DruidException` with error code 
`entryAlreadyExists` [#14448](https://github.com/apache/druid/pull/14448)
+* The task status output for a failed task now includes the exception message 
[#16286](https://github.com/apache/druid/pull/16286)
+
+### SQL-based ingestion
+
+#### Sorting on complex columns
+
+The MSQ task engine now supports sorting and grouping on complex columns.
+This change also allows the MSQ task engine to roll up on JSON columns.
+
+[#16322](https://github.com/apache/druid/pull/16322)
+
+#### Manifest files for MSQ task engine exports
+
+Export queries that use the MSQ task engine now also create a manifest file at 
the destination, which lists the files created by the query.
+
+During a rolling update, older versions of workers don't return a list of 
exported files, and older Controllers don't create a manifest file.
+Therefore, export queries ran during this time might have incomplete manifests.
+
+[#15953](https://github.com/apache/druid/pull/15953)
+
+#### `SortMerge` join support
+
+Druid now supports `SortMerge` join for `IS NOT DISTINCT FROM` operations.
+
+[#16003](https://github.com/apache/druid/pull/16003)
+
+#### State of compaction context parameter
+
+Added a new context parameter `storeCompactionState`.
+When set to `true`, Druid records the state of compaction for each segment in 
the `lastCompactionState` segment field.
+
+[#15965](https://github.com/apache/druid/pull/15965)
+
+#### Selective loading of lookups
+
+Druid now supports selective loading of lookups in the task layer.

Review Comment:
   ```suggestion
   We have built the foundation of selective lookup loading. As part of 
enabling this, `KillUnusedSegmentsTask` does not load lookups. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Druid 30.0.0 release notes (druid)

Reply via email to