Re: [PR] Druid 31.0.0 release notes (druid)

via GitHub Mon, 28 Oct 2024 22:35:20 -0700


kfaraz commented on code in PR #17092:
URL: https://github.com/apache/druid/pull/17092#discussion_r1820115613



##########
docs/release-info/release-notes.md:
##########
@@ -57,46 +57,588 @@ For tips about how to write a good release note, see 
[Release notes](https://git
 
 This section contains important information about new and existing features.
 
-## Functional area and related changes
+### Compaction features
+
+Druid now supports the following features:
+
+- Compaction scheduler with greater flexibility and control over when and what 
to compact.
+- MSQ task engine-based auto-compaction for more performant compaction jobs.
+
+For more information, see [Compaction 
supervisors](#compaction-supervisors-experimental).
+
+[#16291](https://github.com/apache/druid/pull/16291) 
[#16768](https://github.com/apache/druid/pull/16768)
+
+Additionally, compaction tasks that take advantage of concurrent append and 
replace is now generally available as part of concurrent append and replace 
becoming GA.
+
+### Window functions are GA
+
+[Window 
functions](https://druid.apache.org/docs/latest/querying/sql-window-functions) 
are now generally available in Druid's native engine and in the MSQ task engine.
+
+- You no longer need to use the query context `enableWindowing` to use window 
functions. [#17087](https://github.com/apache/druid/pull/17087)
+
+### Concurrent append and replace GA
+
+Concurrent append and replace is now GA. The feature safely replaces the 
existing data in an interval of a datasource while new data is being appended 
to that interval. One of the most common applications of this feature is 
appending new data (such as with streaming ingestion) to an interval while 
compaction of that interval is already in progress. 
+
+### Delta Lake improvements
+
+The community extension for Delta Lake has been improved to support [complex 
types](#delta-lake-complex-types) and [snapshot 
versions](#delta-lake-snapshot-versions).
+
+### Iceberg improvements
+
+The community extension for Iceberg has been improved. For more information, 
see [Iceberg improvements](#iceberg-improvements)
+
+### Projections (experimental)
+
+Druid 31.0.0 includes experimental support for new feature called projections. 
Projections are grouped pre-aggregates of a segment that are automatically used 
at query time to optimize execution for any queries which 'fit' the shape of 
the projection by reducing both computation and i/o cost by reducing the number 
of rows which need to be processed. Projections are contained within segments 
of a datasource and do increase the segment size. But they can share data, such 
as value dictionaries of dictionary encoded columns, with the columns of the 
base segment.
+
+Projections currently only support JSON-based ingestion, but they can be used 
by queries that use the MSQ task engine or the new Dart engine. Future 
development will allow projections to be created as part of SQL-based ingestion.
+
+We have a lot of plans to continue to improve this feature in the coming 
releases, but are excited to get it out there so users can begin 
experimentation since projections can dramatically improve query performance.
+
+For more information, see [Projections](#projections).
+
+### Low latency high complexity queries using Dart (experimental)
+
+Distributed Asynchronous Runtime Topology (Dart) is designed to support high 
complexity queries, such as large joins, high cardinality group by, subqueries 
and common table expressions, commonly found in ad-hoc, data warehouse 
workloads. Instead of using data warehouse engines like Spark or Presto to 
execute high-complexity queries, you can use Dart, alleviating the need for 
additional infrastructure.
+
+For more information, see [Dart](#dart).
+
+[#17140](https://github.com/apache/druid/pull/17140)
+
+### Storage improvements
+
+Druid 31.0.0 includes several improvements to how data is stored by Druid, 
including compressed columns and flexible segment sorting. For more 
information, see [Storage improvements](#storage-improvements-1).
+
+### Upgrade-related changes
+
+See the [Upgrade notes](#upgrade-notes) for more information about the 
following upgrade-related changes:
+- [Array ingest mode now defaults to 
array](#array-ingest-mode-now-defaults-to-array)
+- [Disabled ZK-based segment loading](#zk-based-segment-loading)
+- [Removed task action audit logging](#removed-task-action-audit-logging)
+- [Removed Firehose and FirehoseFactory](#removed-firehose-and-firehosefactory)
+- [Removed the scan query legacy mode](#removed-the-scan-query-legacy-mode)
+
+### Deprecations
+
+#### Java 8 support
+
+Java 8 support is now deprecated and will be removed in 32.0.0. 
+
+#### Other deprecations
+- Deprecated API `/lockedIntervals` is now removed 
[#16799](https://github.com/apache/druid/pull/16799)
+- [Cluster-level compaction 
API](#api-for-cluster-level-compaction-configuration) deprecates task slots 
compaction API [#16803](https://github.com/apache/druid/pull/16803)
+- The `arrayIngestMode` context parameter is deprecated and will be removed. 
For more information, see [Array ingest mode now defaults to 
array](#array-ingest-mode-now-defaults-to-array).
+
+## Functional areas and related changes
 
 This section contains detailed release notes separated by areas.
 
 ### Web console
 
+#### Improvements to the stages display
+
+A number of improvements have been made to the query stages visualization
+![new_stages.png](web-console-31-rn-new_stages.png)
+These changes include:
+- Added a graph visualization to illustrate the flow of query stages 
[#17135](https://github.com/apache/druid/pull/17135)
+- Added a column for CPU counters in the query stages detail view when they 
are present. Also added tool tips to expose potentially hidden data like CPU 
time [#17132](https://github.com/apache/druid/pull/17132)
+
+#### Dart
+Added the ability to detect the presence of the Dart engine and to run Dart 
queries from the console as well as to see currently running Dart queries.
+![dart.png](web-console-31-rn-dart.png)
+[#17147](https://github.com/apache/druid/pull/17147)
+
+#### Copy query results as SQL
+You can now copy the results of a query as a Druid SQL statement: 
+![](web-console-31-rn-copy-results-as-sql.png)
+When you copy the results of the pictured query, you get the following query:
+```sql
+SELECT
+  CAST("c1" AS VARCHAR) AS "channel",
+  CAST("c2" AS VARCHAR) AS "cityName",
+  DECODE_BASE64_COMPLEX('thetaSketch', "c3") AS "user_theta"
+FROM (
+  VALUES
+  ('ca', NULL, 'AQMDAAA6zJOcUskA1pEMGA=='),
+  ('de', NULL, 'AQMDAAA6zJP43ITYexvoEw=='),
+  ('de', NULL, 'AQMDAAA6zJNtue8WOvrJdA=='),
+  ('en', NULL, 'AQMDAAA6zJMruSUUqmzufg=='),
+  ('en', NULL, 'AQMDAAA6zJM6dC5sW2sTEg=='),
+  ('en', NULL, 'AQMDAAA6zJM6dC5sW2sTEg=='),
+  ('en', NULL, 'AQMDAAA6zJPqjEoIBIGtDw=='),
+  ('en', 'Caulfield', 'AQMDAAA6zJOOtGipKE6KIA=='),
+  ('fa', NULL, 'AQMDAAA6zJM8nxZkoGPlLw=='),
+  ('vi', NULL, 'AQMDAAA6zJMk4ZadSFqHJw==')
+) AS "t" ("c1", "c2", "c3")
+```
+[#16458](https://github.com/apache/druid/pull/16458)
+
+#### Explore view improvements
+You can now configure the Explore view on top of a source query instead of 
only existing tables.
+You can also point and click to edit the source query, store measures in the 
source query,
+and return to the state of your view using stateful URLs.
+[#17180](https://github.com/apache/druid/pull/17180)
+
+Other changes to the Explore view include the following:
+- Added the ability to define expressions and measures on the fly
+- Added the ability to hide all null columns in the record table
+- Added the ability to expand a nested column into is constituent paths
+[#17213](https://github.com/apache/druid/pull/17213) 
[#17225](https://github.com/apache/druid/pull/17225) 
[#17234](https://github.com/apache/druid/pull/17234) 
[#17180](https://github.com/apache/druid/pull/17180)
+
+#### Support Delta lake ingestion in the data loaders
+- Added the Delta tile to the data loader for SQL-based batch and classic 
batch ingestion methods
+[#17160](https://github.com/apache/druid/pull/17160) 
[#17023](https://github.com/apache/druid/pull/17023)
+
+#### Support Kinesis input format
+The web console now supports the Kinesis input format.
+[#16850](https://github.com/apache/druid/pull/16850)
+
 #### Other web console improvements
+- You can now search for datasources in the segment timeline visualization - 
previously you had to find them manually 
[#16371](https://github.com/apache/druid/pull/16371)
+- Showing JSON values (in query results) now display both raw and formatted 
JSON, making the data easier to read and troubleshoot 
[#16632](https://github.com/apache/druid/pull/16632)
+- Added the ability to submit a suspended supervisor from the data loader 
[#16696](https://github.com/apache/druid/pull/16696)
+- Added column mapping information to the explain plan 
[#16598](https://github.com/apache/druid/pull/16598)
+- Updated the Web Console to use array mode by default for schema discovery 
[#17133](https://github.com/apache/druid/pull/17133)
+- Fixed NPE due to null values in numeric columns 
[#16760](https://github.com/apache/druid/pull/16760)
 
 ### Ingestion
 
+#### Optimized the loading of broadcast data sources
+
+Previously all services and tasks downloaded all broadcast data sources.
+To save task storage space and reduce task startup time, this modification 
prevents kill tasks and MSQ controller tasks from downloading unneeded 
broadcast data sources. All other tasks still load all broadcast data sources.
+
+The `CLIPeon` command line option `--loadBroadcastSegments` is deprecated in 
favor of `--loadBroadcastDatasourceMode`.
+
+[#17027](https://github.com/apache/druid/pull/17027)
+
+#### General ingestion improvements
+
+- The default value for `druid.indexer.tasklock.batchAllocationWaitTime` is 
now 0 [#16578](https://github.com/apache/druid/pull/16578)
+- Hadoop-based ingestion now works on Kubernetes deployments 
[#16726](https://github.com/apache/druid/pull/16726)
+- Hadoop-based ingestion now has a Boolean config `useMaxMemoryEstimates` 
parameter, which controls how memory footprint gets estimated. The default is 
false, so that the behavior matches native JSON-based batch ingestion 
[#16280](https://github.com/apache/druid/pull/16280)
+- Added `druid-parquet-extensions` to all example quickstart configurations 
[#16664](https://github.com/apache/druid/pull/16664)
+- Added support for ingesting CSV format data into Kafka records when Kafka 
ingestion is enabled with `ioConfig.type = kafka` 
[#16630](https://github.com/apache/druid/pull/16630)
+- Added logging for sketches on workers 
[#16697](https://github.com/apache/druid/pull/16697)
+- Removed obsolete tasks `index_realtime` and `index_realtime_appenderator` 
tasks&mdash;you can no longer use these tasks to ingest data 
[#16602](https://github.com/apache/druid/pull/16602)
+- Renamed `TaskStorageQueryAdapter` to `TaskQueryTool` and removed the 
`isAudited` method [#16750](https://github.com/apache/druid/pull/16750)
+- Improved Overlord performance by reducing redundant calls in SQL statements 
[#16839](https://github.com/apache/druid/pull/16839)
+- Improved `CustomExceptionMapper` so that it returns a correct failure 
message [#17016](https://github.com/apache/druid/pull/17016)
+- Improved time filtering in subqueries and non-table data sources 
[#17173](https://github.com/apache/druid/pull/17173)
+- Improved `WindowOperatorQueryFrameProcessor` to avoid unnecessary re-runs 
[#17211](https://github.com/apache/druid/pull/17211)
+- Improved memory management by dividing the amount of `partitionStatsMemory` 
by two to account for two simultaneous statistics collectors 
[#17216](https://github.com/apache/druid/pull/17216)
+- Fixed NPE in `CompactSegments` 
[#16713](https://github.com/apache/druid/pull/16713)
+- Fixed Parquet reader to ensure that Druid reads the required columns for a 
filter from the Parquet data files 
[#16874](https://github.com/apache/druid/pull/16874)
+- Fixed a distinct sketches issue where Druid called `retainedKeys.firstKey()` 
twice when adding another sketch 
[#17184](https://github.com/apache/druid/pull/17184)
+- Fixed a `WindowOperatorQueryFrameProcessor` issue where larger queries could 
reach the frame writer's capacity preventing it from outputting all of the 
result rows [#17209](https://github.com/apache/druid/pull/17209)
+- Fixed native ingestion task failures during rolling upgrades from a version 
before Druid 30 [#17219](https://github.com/apache/druid/pull/17219)
+
 #### SQL-based ingestion
 
-##### Other SQL-based ingestion improvements
+##### Optimized S3 storage writing for MSQ durable storage
+
+For queries that use the MSQ task engine and write their output to S3 as 
durable storage, uploading chunks of data is now faster.
+
+[#16481](https://github.com/apache/druid/pull/16481)
+
+##### Improved lookup performance
+
+Improved lookup performance for queries that use the MSQ task engine by only 
loading required lookups. This applies to both ingestion and querying.
+
+[#16358](https://github.com/apache/druid/pull/16358)
+
+#### Other SQL-based ingestion improvements
+
+- Added the ability to use `useConcurrentLocks` in task context to determine 
task lock type [#17193](https://github.com/apache/druid/pull/17193)
+- Improved error handling when retrieving Avro schemas from registry 
[#16684](https://github.com/apache/druid/pull/16684)
+- Fixed issues related to partitioning boundaries in the MSQ task engine's 
window functions [#16729](https://github.com/apache/druid/pull/16729)
+- Fixed a boost column issue causing quantile sketches to incorrectly estimate 
the number of output partitions to create 
[#17141](https://github.com/apache/druid/pull/17141)
+- Fixed an issue with `ScanQueryFrameProcessor` cursor build not adjusting 
intervals [#17168](https://github.com/apache/druid/pull/17168)
+- Improved worker cancellation for the MSQ task engine to prevent race 
conditions [#17046](https://github.com/apache/druid/pull/17046)
+- Improved memory management to better support multi-threaded workers 
[#17057](https://github.com/apache/druid/pull/17057)
+- Added new format for serialization of sketches between MSQ controller and 
worker to reduce memory usage 
[#16269](https://github.com/apache/druid/pull/16269)
+- Improved error handling when retrieving Avro schemas from registry 
[#16684](https://github.com/apache/druid/pull/16684)
+- Fixed issues related to partitioning boundaries in the MSQ task engine's 
window functions [#16729](https://github.com/apache/druid/pull/16729)
+- Fixed handling of null bytes that led to a runtime exception for "Invalid 
value start byte" [#17232](https://github.com/apache/druid/pull/17232)
+- Updated logic to fix incorrect query results for comparisons involving 
arrays [#16780](https://github.com/apache/druid/pull/16780)
+- You can now pass a custom `DimensionSchema` map to MSQ query destination of 
type `DataSourceMSQDestination` instead of using the default values 
[#16864](https://github.com/apache/druid/pull/16864)
+- Fixed the calculation of suggested memory in `WorkerMemoryParameters` to 
account for `maxConcurrentStages` which improves the accuracy of error messages 
[#17108](https://github.com/apache/druid/pull/17108)
+- Optimized the row-based frame writer to reduce failures when writing larger 
single rows to frames [#17094](https://github.com/apache/druid/pull/17094)
+
+### Streaming ingestion
+
+#### New Kinesis input format
 
-#### Streaming ingestion
+Added a Kinesis input format and reader for timestamp and payload parsing.
+The reader relies on a `ByteEntity` type of `KinesisRecordEntity` which 
includes the underlying Kinesis record.
 
-##### Other streaming ingestion improvements
+[#16813](https://github.com/apache/druid/pull/16813)
+
+#### Streaming ingestion improvements
+
+- Added a check for handing off upgraded real-time segments. This prevents 
data from being temporarily unavailable for queries during segment handoff 
[#16162](https://github.com/apache/druid/pull/16162)
+- Improved the user experience for autoscaling Kinesis. Switching to 
autoscaling based on max lag per shard from total lag for shard is now 
controlled by the `lagAggregate` config, defaulting to sum 
[#16334](https://github.com/apache/druid/pull/16334)
+- Improved the Supervisor so that it doesn't change to a running state from 
idle if the Overlord restarts 
[#16844](https://github.com/apache/druid/pull/16844)
 
 ### Querying
 
+#### Dart
+
+Dart is a query engine with multi-threaded workers that conducts in-memory 
shuffles to provide tunable concurrency and infrastructure costs for 
low-latency high complexity queries.
+
+To try out Dart, set `druid.msq.dart.enabled` to `true` in your common runtime 
properties. Then, you can select Dart as a query engine in the web console or 
through the API `/druid/v2/sql/dart`, which accepts the same JSON payload as 
`/druid/v2/sql`. Dart is fully compatible with current Druid query shapes and 
Druid's storage format. That means you can try Dart with your existing queries 
and datasources.
+
+#### Projections
+
+As an experimental feature, projections are not well documented yet, but can 
be defined for streaming ingestion and 'classic' batch ingestion as part of the 
`dataSchema`. For an example ingestion spec, see the [the pull request 
description for #17214](https://github.com/apache/druid/pull/17214).
+
+The `groupingColumns` field in the spec defines the order which data is sorted 
in the projection. Instead of explicitly defining granularity like for the base 
table, you define it with a virtual column. During ingestion, the processing 
logic finds the ‘finest’ granularity virtual column that is a `timestamp_floor` 
expression and uses it as the `__time` column for the projection. Projections 
do not need to have a time column defined. In these cases, they can still match 
queries that are not grouping on time.
+
+There are new query context flags that have been added to aid in 
experimentation with projections:
+
+- `useProjection` accepts a specific projection name and instructs the query 
engine that it must use that projection, and will fail the query if the 
projection does not match the query
+- `forceProjections` accepts true or false and instructs the query engine that 
it must use a projection, and will fail the query if it cannot find a matching 
projection
+- `noProjections` accpets true or false and instructs the query engines to not 
use any projections

Review Comment:
   ```suggestion
   - `noProjections` accepts true or false and instructs the query engines to 
not use any projections
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Druid 31.0.0 release notes (druid)

Reply via email to