jon-wei opened a new issue #10462: URL: https://github.com/apache/druid/issues/10462
Apache Druid 0.20.0 contains around 140 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the [complete list of changes](https://github.com/apache/druid/compare/0.19.0...0.20.0) and [everything tagged to the milestone](https://github.com/apache/druid/milestone/40) for further details. # <a name="20-new-features" href="#20-new-features">#</a> New Features ## <a name="20-hash-segment-pruning" href="#20-hash-segment-pruning">#</a> Query segment pruning with hash partitioning Druid now supports query-time segment pruning (excluding certain segments as read candidates for a query) for hash partitioned segments. This optimization applies when all of the `partitionDimensions` specified in the hash partition spec during ingestion time are present in the filter set of a query, and the filters in the query filter on discrete values of the `partitionDimensions` (e.g., selector filters). Segment pruning with hash partitioning is not supported with non-discrete filters such as bound filters. For existing users with existing segments, you will need to reingest those segments to take advantage of this new feature, as the segment pruning requires a `partitionFunction` to be stored together with the segments, which does not exist in segments created by older versions of Druid. It is not necessary to specify the `partitionFunction` explicitly, as the default is the same partition function that was used in prior versions of Druid. Note that segments created with a default `partitionDimensions` value (partition by all dimensions + the time column) cannot be pruned in this manner, the segments need to be created with an explicit `partitionDimensions`. https://github.com/apache/druid/pull/9810 https://github.com/apache/druid/pull/10288 ## <a name="20-cluster-wide-default-query-context" href="#20-cluster-wide-default-query-context">#</a> Cluster-wide default query context settings It is now possible to set cluster-wide default query context properties by adding a configuration of the form `druid.query.override.default.context.*`, with `*` replaced by the property name. https://github.com/apache/druid/pull/10208 ## <a name="20-improved-retention-rules-ui" href="#20-improved-retention-rules-ui">#</a> Improved retention rules UI The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible. https://github.com/apache/druid/pull/10226 ## <a name="20-groupby-offset" href="#20-groupby-offset">#</a> `offset` parameter for GroupBy and Scan queries It is now possible set an `offset` parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results. Please see https://druid.apache.org/docs/latest/querying/limitspec.html and https://druid.apache.org/docs/latest/querying/scan-query.html for details. https://github.com/apache/druid/pull/10235 https://github.com/apache/druid/pull/10233 ## <a name="20-sql-offset" href="#20-sql-offset">#</a> `OFFSET` clause for SQL queries Druid SQL queries now support an `OFFSET` clause. Please see https://druid.apache.org/docs/latest/querying/sql.html#offset for details. https://github.com/apache/druid/pull/10279 ## <a name="20-sql-contains" href="#20-sql-contains">#</a> Substring search operators Druid has added new substring search operators in its expression language and for SQL queries. Please see documentation for `CONTAINS_STRING` and `ICONTAINS_STRING` string functions for Druid SQL (https://druid.apache.org/docs/latest/querying/sql.html#string-functions) and documentation for `contains_string` and `icontains_string` for the Druid expression language (https://druid.apache.org/docs/latest/misc/math-expr.html#string-functions). https://github.com/apache/druid/pull/10350 ## <a name="20-sql-union-all" href="#20-sql-union-all">#</a> UNION ALL operator for SQL queries Druid SQL queries now support the `UNION ALL` operator, which fuses the results of multiple queries together. Please see https://druid.apache.org/docs/latest/querying/sql.html#union-all for details on what query shapes are supported by this operator. https://github.com/apache/druid/pull/10324 ## <a name="20-vectorized-min-max" href="#20-vectorized-min-max">#</a> Vectorization support for long, double, float min & max aggregators Vectorization support has been added for several aggregation types: numeric min/max aggregators, variance aggregators, ANY aggregators, and aggregators from the `druid-histogram` extension. https://github.com/apache/druid/pull/10260 - numeric min/max https://github.com/apache/druid/pull/10304 - histogram https://github.com/apache/druid/pull/10338 - ANY https://github.com/apache/druid/pull/10390 - variance ## <a name="20-vectorized-virtual-columns" href="#20-vectorized-virtual-columns">#</a> Vectorization support for expression virtual columns Expression virtual columns now have vectorization support (depending on the expressions being used), which an results in a 3-5x performance improvement in some cases. Please see https://druid.apache.org/docs/latest/misc/math-expr.html#vectorization-support for details on the specific expressions that support vectorization, and https://druid.apache.org/docs/latest/querying/query-context.html#vectorization-parameters for more information on query context parameters that control vectorization. https://github.com/apache/druid/pull/10388 https://github.com/apache/druid/pull/10401 https://github.com/apache/druid/pull/10432 ## <a name="20-split-hint-max-files" href="#20-split-hint-max-files">#</a> Subtask file count limits for parallel batch ingestion The size-based `splitHintSpec` now supports a new `maxNumFiles` parameter, which limits how many files can be assigned to individual subtasks in parallel batch ingestion. The segment-based `splitHintSpec` used for reingesting data from existing Druid segments also has a new `maxNumSegments` parameter which functions similarly. Please see https://druid.apache.org/docs/latest/ingestion/native-batch.html#split-hint-spec for more details. https://github.com/apache/druid/pull/10243 ## <a name="20-redis-extension" href="#20-redis-extension">#</a> Redis cache extension enhancements The Redis cache extension now supports Redis Cluster, selecting which database is used, connecting to password-protected servers, and period-style configurations for the `expiration` and `timeout` properties. https://github.com/apache/druid/pull/10240 ## <a name="20-auto-compaction-partition" href="#20-auto-compaction-partition">#</a> Support for all partitioning schemes for auto-compaction A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new `partitionsSpec` property in the compaction `tuningConfig` for more details: https://druid.apache.org/docs/latest/configuration/index.html#compaction-tuningconfig https://druid.apache.org/docs/latest/configuration/index.html#compaction-tuningconfig https://github.com/apache/druid/pull/10307 ## <a name="20-combining-input-source" href="#20-combining-input-source">#</a> Combining InputSource A new combining InputSource has been added, allowing the user to combine multiple input sources during ingestion. Please see https://druid.apache.org/docs/latest/ingestion/native-batch.html#combining-input-source for more details. https://github.com/apache/druid/pull/10387 ## <a name="20-autocompaction-status-api" href="#20-autocompaction-status-api">#</a> Auto-compaction status API A new coordinator API which shows the status of auto-compaction for a datasource has been added. The new API shows whether auto-compaction is enabled for a datasource, and a summary of how far compaction has progressed. The web console has also been updated to show this information: https://user-images.githubusercontent.com/177816/94326243-9d07e780-ff57-11ea-9f80-256fa08580f0.png TBD: pending docs for this feature, will link when available https://github.com/apache/druid/pull/10371 https://github.com/apache/druid/pull/10438 ## <a name="20-auto-num-shards" href="#20-auto-num-shards">#</a> Automatically determine numShards for parallel ingestion hash partitioning When hash partitioning is used in parallel batch ingestion, it is no longer necessary to specify `numShards` in the partition spec. Druid can now automatically determine a number of shards by scanning the data in a new ingestion phase that determines the cardinalities of the partitioning key. https://github.com/apache/druid/pull/10419 ## <a name="20-task-slot-metrics" href="#20-task-slot-metrics">#</a> Task slot usage metrics New task slot usage metrics have been added. Please see the entries for the `taskSlot` metrics at https://druid.apache.org/docs/latest/operations/metrics.html#indexing-service for more details. https://github.com/apache/druid/pull/10379 ## <a name="20-disable-server-version" href="#20-disable-server-version">#</a> Disable sending server version in response headers It is now possible to disable sending of server version information in Druid's response headers. This is controlled by a new property `druid.server.http.sendServerVersion`, which defaults to `true`. https://github.com/apache/druid/pull/9832 # <a name="20-bugs" href="#20-bugs">#</a> Bug fixes ## <a name="20-auto-num-shards" href="#20-auto-num-shards">#</a> Fix query correctness issue when historical has no segment timeline Druid 0.20.0 fixes a query correctness issue when a broker issues a query expecting a historical to have certain segments for a datasource, but the historical when queried does not actually have any segments for that datasource (e.g., they were all unloaded before the historical processed the query). Prior to 0.20.0, the query would return successfully but without the results from the segments that were missing in the manner described previously. In 0.20.0, queries will now fail in such situations. https://github.com/apache/druid/pull/10199 ## <a name="20-result-caching" href="#20-result-caching">#</a> Fix issue preventing result-level cache from being populated Druid 0.20.0 fixes an issue introduced in 0.19.0 (https://github.com/apache/druid/issues/10337) which can prevent query caches from being populated when result-level caching is enabled. https://github.com/apache/druid/pull/10341 ## <a name="20-variance-comparator" href="#20-variance-comparator">#</a> Fix for variance aggregator ordering The variance aggregator previously used an incorrect comparator that compared using an aggregator's internal `count` variable instead of the variance. https://github.com/apache/druid/pull/10340 ## <a name="20-limitspec-cache" href="#20-limitspec-cache">#</a> Fix incorrect caching for groupBy queries with limit specs Druid 0.20.0 fixes an issues with groupBy queries and caching, where the limitSpec of the query was not considered in the cache key, leading to potentially incorrect results if queries that are identical except for the limitSpec are issued. https://github.com/apache/druid/pull/10093 # <a name="20-upgrading-from-previous" href="#20-upgrading-from-previous">#</a> Upgrading to Druid 0.20.0 Please be aware of the following considerations when upgrading from 0.19.0 to 0.20.0. If you're updating from an earlier version than 0.19.0, please see the release notes of the relevant intermediate versions. ## <a name="20-default-max-size" href="#20-default-max-size">#</a> Default `maxSize` `druid.server.maxSize` will now default to the sum of `maxSize` values defined within the `druid.segmentCache.locations`. The user can still provide a custom value for `druid.server.maxSize` which will take precedence over the default value. https://github.com/apache/druid/pull/10255 ## <a name="20-id-name-change" href="#20-id-name-change">#</a> Compaction and kill task ID changes Compaction and kill tasks issued by the coordinator will now have their task IDs prefixed by `coordinator-issued`, while user-issued kill tasks will be prefixed by `api-issued`. https://github.com/apache/druid/pull/10278 ## <a name="20-new-size-limit-split" href="#20-new-size-limit-split">#</a> New size limits for parallel ingestion split hint specs The size-based and segment-based `splitHintSpec` for parallel batch ingestion now apply a default file/segment limit of 1000 per subtask, controlled by the `maxNumFiles` and `maxNumSegments` respectively. https://github.com/apache/druid/pull/10243 ## <a name="20-new-agg-methods" href="#20-new-agg-methods">#</a> New `PostAggregator` and `AggregatorFactory` methods Users who have developed an extension with custom `PostAggregator` or `AggregatorFactory` implementions will need to update their extensions, as these two interfaces have new methods defined in 0.20.0. `PostAggregator` now has a new method: ``` ValueType getType(); ``` To support type information on `PostAggregator`, `AggregatorFactory` also has 2 new methods: ``` public abstract ValueType getType(); public abstract ValueType getFinalizedType(); ``` ## <a name="20-new-expr-methods" href="#20-new-expr-methods">#</a> New `Expr`-related methods Users who have developed an extension with custom `Expr` implementions will need to update their extensions, as `Expr` and related interfaces hae changed in 0.20.0. Please see the PR below for details: https://github.com/apache/druid/pull/10401 Please see https://github.com/apache/druid/pull/9638 for more details on the interface changes. ## <a name="20-sequence-time" href="#20-sequence-time">#</a> More accurate `query/cpu/time` metric In 0.20.0, the accuracy of the `query/cpu/time` metric has been improved. Previously, it did not account for certain portions of work during query processing, described in more detail in the following PR: https://github.com/apache/druid/pull/10377 ## <a name="20-audit-log-cols" href="#20-audit-log-cols">#</a> New audit log service metric columns If you are using audit logging, please be aware that new columns have been added to the audit log service metric (`comment`, `remote_address`, and `created_date`). An optional `payload` column has also been added, which can be enabled by setting `druid.audit.manager.includePayloadAsDimensionInMetric` to `true`. https://github.com/apache/druid/pull/10373 ## <a name="20-request-log-sql-context" href="#20-request-log-sql-context">#</a> `sqlQueryContext` in request logs If you are using query request logging, the request log events will now include the `sqlQueryContext` for SQL queries. https://github.com/apache/druid/pull/10368 ## <a name="20-last-compaction-state" href="#20-last-compaction-state">#</a> Additional per-segment state in metadata store Hash-partitioned segments created by Druid 0.20.0 will now have additional `partitionFunction` data in the metadata store. Additionally, compaction tasks will now store additional per-segment information in the metadata store, used for tracking compaction history. https://github.com/apache/druid/pull/10288 https://github.com/apache/druid/pull/10413 # <a name="20-credits" href="#20-credits">#</a> Credits Thanks to everyone who contributed to this release! @a2l007 @abhishekagarwal87 @abhishekrb19 @ArvinZheng @belugabehr @capistrant @ccaominh @clintropolis @code-crusher @dylwylie @fermelone @FrankChen021 @gianm @himanshug @jihoonson @jon-wei @joykent99 @kroeders @lightghli @mans2singh @maytasm @medb @mghosh4 @nishantmonu51 @pan3793 @richardstartin @sthetland @suneet-s @tarunparackal @tdt17 @tourvi @vogievetsky @wjhypo @xiangqiao123 @xvrl --- TBD: are there any breaking changes from https://github.com/apache/druid/pull/10203 https://github.com/apache/druid/pull/9810 https://github.com/apache/druid/pull/10307 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
