kfaraz commented on code in PR #17092: URL: https://github.com/apache/druid/pull/17092#discussion_r1820115613
########## docs/release-info/release-notes.md: ########## @@ -57,46 +57,588 @@ For tips about how to write a good release note, see [Release notes](https://git This section contains important information about new and existing features. -## Functional area and related changes +### Compaction features + +Druid now supports the following features: + +- Compaction scheduler with greater flexibility and control over when and what to compact. +- MSQ task engine-based auto-compaction for more performant compaction jobs. + +For more information, see [Compaction supervisors](#compaction-supervisors-experimental). + +[#16291](https://github.com/apache/druid/pull/16291) [#16768](https://github.com/apache/druid/pull/16768) + +Additionally, compaction tasks that take advantage of concurrent append and replace is now generally available as part of concurrent append and replace becoming GA. + +### Window functions are GA + +[Window functions](https://druid.apache.org/docs/latest/querying/sql-window-functions) are now generally available in Druid's native engine and in the MSQ task engine. + +- You no longer need to use the query context `enableWindowing` to use window functions. [#17087](https://github.com/apache/druid/pull/17087) + +### Concurrent append and replace GA + +Concurrent append and replace is now GA. The feature safely replaces the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this feature is appending new data (such as with streaming ingestion) to an interval while compaction of that interval is already in progress. + +### Delta Lake improvements + +The community extension for Delta Lake has been improved to support [complex types](#delta-lake-complex-types) and [snapshot versions](#delta-lake-snapshot-versions). + +### Iceberg improvements + +The community extension for Iceberg has been improved. For more information, see [Iceberg improvements](#iceberg-improvements) + +### Projections (experimental) + +Druid 31.0.0 includes experimental support for new feature called projections. Projections are grouped pre-aggregates of a segment that are automatically used at query time to optimize execution for any queries which 'fit' the shape of the projection by reducing both computation and i/o cost by reducing the number of rows which need to be processed. Projections are contained within segments of a datasource and do increase the segment size. But they can share data, such as value dictionaries of dictionary encoded columns, with the columns of the base segment. + +Projections currently only support JSON-based ingestion, but they can be used by queries that use the MSQ task engine or the new Dart engine. Future development will allow projections to be created as part of SQL-based ingestion. + +We have a lot of plans to continue to improve this feature in the coming releases, but are excited to get it out there so users can begin experimentation since projections can dramatically improve query performance. + +For more information, see [Projections](#projections). + +### Low latency high complexity queries using Dart (experimental) + +Distributed Asynchronous Runtime Topology (Dart) is designed to support high complexity queries, such as large joins, high cardinality group by, subqueries and common table expressions, commonly found in ad-hoc, data warehouse workloads. Instead of using data warehouse engines like Spark or Presto to execute high-complexity queries, you can use Dart, alleviating the need for additional infrastructure. + +For more information, see [Dart](#dart). + +[#17140](https://github.com/apache/druid/pull/17140) + +### Storage improvements + +Druid 31.0.0 includes several improvements to how data is stored by Druid, including compressed columns and flexible segment sorting. For more information, see [Storage improvements](#storage-improvements-1). + +### Upgrade-related changes + +See the [Upgrade notes](#upgrade-notes) for more information about the following upgrade-related changes: +- [Array ingest mode now defaults to array](#array-ingest-mode-now-defaults-to-array) +- [Disabled ZK-based segment loading](#zk-based-segment-loading) +- [Removed task action audit logging](#removed-task-action-audit-logging) +- [Removed Firehose and FirehoseFactory](#removed-firehose-and-firehosefactory) +- [Removed the scan query legacy mode](#removed-the-scan-query-legacy-mode) + +### Deprecations + +#### Java 8 support + +Java 8 support is now deprecated and will be removed in 32.0.0. + +#### Other deprecations +- Deprecated API `/lockedIntervals` is now removed [#16799](https://github.com/apache/druid/pull/16799) +- [Cluster-level compaction API](#api-for-cluster-level-compaction-configuration) deprecates task slots compaction API [#16803](https://github.com/apache/druid/pull/16803) +- The `arrayIngestMode` context parameter is deprecated and will be removed. For more information, see [Array ingest mode now defaults to array](#array-ingest-mode-now-defaults-to-array). + +## Functional areas and related changes This section contains detailed release notes separated by areas. ### Web console +#### Improvements to the stages display + +A number of improvements have been made to the query stages visualization + +These changes include: +- Added a graph visualization to illustrate the flow of query stages [#17135](https://github.com/apache/druid/pull/17135) +- Added a column for CPU counters in the query stages detail view when they are present. Also added tool tips to expose potentially hidden data like CPU time [#17132](https://github.com/apache/druid/pull/17132) + +#### Dart +Added the ability to detect the presence of the Dart engine and to run Dart queries from the console as well as to see currently running Dart queries. + +[#17147](https://github.com/apache/druid/pull/17147) + +#### Copy query results as SQL +You can now copy the results of a query as a Druid SQL statement: + +When you copy the results of the pictured query, you get the following query: +```sql +SELECT + CAST("c1" AS VARCHAR) AS "channel", + CAST("c2" AS VARCHAR) AS "cityName", + DECODE_BASE64_COMPLEX('thetaSketch', "c3") AS "user_theta" +FROM ( + VALUES + ('ca', NULL, 'AQMDAAA6zJOcUskA1pEMGA=='), + ('de', NULL, 'AQMDAAA6zJP43ITYexvoEw=='), + ('de', NULL, 'AQMDAAA6zJNtue8WOvrJdA=='), + ('en', NULL, 'AQMDAAA6zJMruSUUqmzufg=='), + ('en', NULL, 'AQMDAAA6zJM6dC5sW2sTEg=='), + ('en', NULL, 'AQMDAAA6zJM6dC5sW2sTEg=='), + ('en', NULL, 'AQMDAAA6zJPqjEoIBIGtDw=='), + ('en', 'Caulfield', 'AQMDAAA6zJOOtGipKE6KIA=='), + ('fa', NULL, 'AQMDAAA6zJM8nxZkoGPlLw=='), + ('vi', NULL, 'AQMDAAA6zJMk4ZadSFqHJw==') +) AS "t" ("c1", "c2", "c3") +``` +[#16458](https://github.com/apache/druid/pull/16458) + +#### Explore view improvements +You can now configure the Explore view on top of a source query instead of only existing tables. +You can also point and click to edit the source query, store measures in the source query, +and return to the state of your view using stateful URLs. +[#17180](https://github.com/apache/druid/pull/17180) + +Other changes to the Explore view include the following: +- Added the ability to define expressions and measures on the fly +- Added the ability to hide all null columns in the record table +- Added the ability to expand a nested column into is constituent paths +[#17213](https://github.com/apache/druid/pull/17213) [#17225](https://github.com/apache/druid/pull/17225) [#17234](https://github.com/apache/druid/pull/17234) [#17180](https://github.com/apache/druid/pull/17180) + +#### Support Delta lake ingestion in the data loaders +- Added the Delta tile to the data loader for SQL-based batch and classic batch ingestion methods +[#17160](https://github.com/apache/druid/pull/17160) [#17023](https://github.com/apache/druid/pull/17023) + +#### Support Kinesis input format +The web console now supports the Kinesis input format. +[#16850](https://github.com/apache/druid/pull/16850) + #### Other web console improvements +- You can now search for datasources in the segment timeline visualization - previously you had to find them manually [#16371](https://github.com/apache/druid/pull/16371) +- Showing JSON values (in query results) now display both raw and formatted JSON, making the data easier to read and troubleshoot [#16632](https://github.com/apache/druid/pull/16632) +- Added the ability to submit a suspended supervisor from the data loader [#16696](https://github.com/apache/druid/pull/16696) +- Added column mapping information to the explain plan [#16598](https://github.com/apache/druid/pull/16598) +- Updated the Web Console to use array mode by default for schema discovery [#17133](https://github.com/apache/druid/pull/17133) +- Fixed NPE due to null values in numeric columns [#16760](https://github.com/apache/druid/pull/16760) ### Ingestion +#### Optimized the loading of broadcast data sources + +Previously all services and tasks downloaded all broadcast data sources. +To save task storage space and reduce task startup time, this modification prevents kill tasks and MSQ controller tasks from downloading unneeded broadcast data sources. All other tasks still load all broadcast data sources. + +The `CLIPeon` command line option `--loadBroadcastSegments` is deprecated in favor of `--loadBroadcastDatasourceMode`. + +[#17027](https://github.com/apache/druid/pull/17027) + +#### General ingestion improvements + +- The default value for `druid.indexer.tasklock.batchAllocationWaitTime` is now 0 [#16578](https://github.com/apache/druid/pull/16578) +- Hadoop-based ingestion now works on Kubernetes deployments [#16726](https://github.com/apache/druid/pull/16726) +- Hadoop-based ingestion now has a Boolean config `useMaxMemoryEstimates` parameter, which controls how memory footprint gets estimated. The default is false, so that the behavior matches native JSON-based batch ingestion [#16280](https://github.com/apache/druid/pull/16280) +- Added `druid-parquet-extensions` to all example quickstart configurations [#16664](https://github.com/apache/druid/pull/16664) +- Added support for ingesting CSV format data into Kafka records when Kafka ingestion is enabled with `ioConfig.type = kafka` [#16630](https://github.com/apache/druid/pull/16630) +- Added logging for sketches on workers [#16697](https://github.com/apache/druid/pull/16697) +- Removed obsolete tasks `index_realtime` and `index_realtime_appenderator` tasks—you can no longer use these tasks to ingest data [#16602](https://github.com/apache/druid/pull/16602) +- Renamed `TaskStorageQueryAdapter` to `TaskQueryTool` and removed the `isAudited` method [#16750](https://github.com/apache/druid/pull/16750) +- Improved Overlord performance by reducing redundant calls in SQL statements [#16839](https://github.com/apache/druid/pull/16839) +- Improved `CustomExceptionMapper` so that it returns a correct failure message [#17016](https://github.com/apache/druid/pull/17016) +- Improved time filtering in subqueries and non-table data sources [#17173](https://github.com/apache/druid/pull/17173) +- Improved `WindowOperatorQueryFrameProcessor` to avoid unnecessary re-runs [#17211](https://github.com/apache/druid/pull/17211) +- Improved memory management by dividing the amount of `partitionStatsMemory` by two to account for two simultaneous statistics collectors [#17216](https://github.com/apache/druid/pull/17216) +- Fixed NPE in `CompactSegments` [#16713](https://github.com/apache/druid/pull/16713) +- Fixed Parquet reader to ensure that Druid reads the required columns for a filter from the Parquet data files [#16874](https://github.com/apache/druid/pull/16874) +- Fixed a distinct sketches issue where Druid called `retainedKeys.firstKey()` twice when adding another sketch [#17184](https://github.com/apache/druid/pull/17184) +- Fixed a `WindowOperatorQueryFrameProcessor` issue where larger queries could reach the frame writer's capacity preventing it from outputting all of the result rows [#17209](https://github.com/apache/druid/pull/17209) +- Fixed native ingestion task failures during rolling upgrades from a version before Druid 30 [#17219](https://github.com/apache/druid/pull/17219) + #### SQL-based ingestion -##### Other SQL-based ingestion improvements +##### Optimized S3 storage writing for MSQ durable storage + +For queries that use the MSQ task engine and write their output to S3 as durable storage, uploading chunks of data is now faster. + +[#16481](https://github.com/apache/druid/pull/16481) + +##### Improved lookup performance + +Improved lookup performance for queries that use the MSQ task engine by only loading required lookups. This applies to both ingestion and querying. + +[#16358](https://github.com/apache/druid/pull/16358) + +#### Other SQL-based ingestion improvements + +- Added the ability to use `useConcurrentLocks` in task context to determine task lock type [#17193](https://github.com/apache/druid/pull/17193) +- Improved error handling when retrieving Avro schemas from registry [#16684](https://github.com/apache/druid/pull/16684) +- Fixed issues related to partitioning boundaries in the MSQ task engine's window functions [#16729](https://github.com/apache/druid/pull/16729) +- Fixed a boost column issue causing quantile sketches to incorrectly estimate the number of output partitions to create [#17141](https://github.com/apache/druid/pull/17141) +- Fixed an issue with `ScanQueryFrameProcessor` cursor build not adjusting intervals [#17168](https://github.com/apache/druid/pull/17168) +- Improved worker cancellation for the MSQ task engine to prevent race conditions [#17046](https://github.com/apache/druid/pull/17046) +- Improved memory management to better support multi-threaded workers [#17057](https://github.com/apache/druid/pull/17057) +- Added new format for serialization of sketches between MSQ controller and worker to reduce memory usage [#16269](https://github.com/apache/druid/pull/16269) +- Improved error handling when retrieving Avro schemas from registry [#16684](https://github.com/apache/druid/pull/16684) +- Fixed issues related to partitioning boundaries in the MSQ task engine's window functions [#16729](https://github.com/apache/druid/pull/16729) +- Fixed handling of null bytes that led to a runtime exception for "Invalid value start byte" [#17232](https://github.com/apache/druid/pull/17232) +- Updated logic to fix incorrect query results for comparisons involving arrays [#16780](https://github.com/apache/druid/pull/16780) +- You can now pass a custom `DimensionSchema` map to MSQ query destination of type `DataSourceMSQDestination` instead of using the default values [#16864](https://github.com/apache/druid/pull/16864) +- Fixed the calculation of suggested memory in `WorkerMemoryParameters` to account for `maxConcurrentStages` which improves the accuracy of error messages [#17108](https://github.com/apache/druid/pull/17108) +- Optimized the row-based frame writer to reduce failures when writing larger single rows to frames [#17094](https://github.com/apache/druid/pull/17094) + +### Streaming ingestion + +#### New Kinesis input format -#### Streaming ingestion +Added a Kinesis input format and reader for timestamp and payload parsing. +The reader relies on a `ByteEntity` type of `KinesisRecordEntity` which includes the underlying Kinesis record. -##### Other streaming ingestion improvements +[#16813](https://github.com/apache/druid/pull/16813) + +#### Streaming ingestion improvements + +- Added a check for handing off upgraded real-time segments. This prevents data from being temporarily unavailable for queries during segment handoff [#16162](https://github.com/apache/druid/pull/16162) +- Improved the user experience for autoscaling Kinesis. Switching to autoscaling based on max lag per shard from total lag for shard is now controlled by the `lagAggregate` config, defaulting to sum [#16334](https://github.com/apache/druid/pull/16334) +- Improved the Supervisor so that it doesn't change to a running state from idle if the Overlord restarts [#16844](https://github.com/apache/druid/pull/16844) ### Querying +#### Dart + +Dart is a query engine with multi-threaded workers that conducts in-memory shuffles to provide tunable concurrency and infrastructure costs for low-latency high complexity queries. + +To try out Dart, set `druid.msq.dart.enabled` to `true` in your common runtime properties. Then, you can select Dart as a query engine in the web console or through the API `/druid/v2/sql/dart`, which accepts the same JSON payload as `/druid/v2/sql`. Dart is fully compatible with current Druid query shapes and Druid's storage format. That means you can try Dart with your existing queries and datasources. + +#### Projections + +As an experimental feature, projections are not well documented yet, but can be defined for streaming ingestion and 'classic' batch ingestion as part of the `dataSchema`. For an example ingestion spec, see the [the pull request description for #17214](https://github.com/apache/druid/pull/17214). + +The `groupingColumns` field in the spec defines the order which data is sorted in the projection. Instead of explicitly defining granularity like for the base table, you define it with a virtual column. During ingestion, the processing logic finds the ‘finest’ granularity virtual column that is a `timestamp_floor` expression and uses it as the `__time` column for the projection. Projections do not need to have a time column defined. In these cases, they can still match queries that are not grouping on time. + +There are new query context flags that have been added to aid in experimentation with projections: + +- `useProjection` accepts a specific projection name and instructs the query engine that it must use that projection, and will fail the query if the projection does not match the query +- `forceProjections` accepts true or false and instructs the query engine that it must use a projection, and will fail the query if it cannot find a matching projection +- `noProjections` accpets true or false and instructs the query engines to not use any projections Review Comment: ```suggestion - `noProjections` accepts true or false and instructs the query engines to not use any projections ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
