[GitHub] [druid] clintropolis opened a new issue #10139: [Draft] Druid 0.19.0 Release Notes

GitBox Mon, 06 Jul 2020 00:32:13 -0700


clintropolis opened a new issue #10139:
URL: https://github.com/apache/druid/issues/10139



   Apache Druid 0.19.0 contains around 200 new features, bug fixes, performance 
enhancements, documentation improvements, and additional test coverage 
improvements from 47 contributors. Refer to the [complete list of 
changes](https://github.com/apache/druid/compare/0.18.1...0.19.0) and 
[everything tagged to the 
milestone](https://github.com/apache/druid/milestone/38) for further details.
   
   # <a name="19-new-features" href="#19-new-features">#</a> New Features
   
   ## <a name="19-vectorize-default" href="#19-vectorize-default">#</a> GroupBy 
and Timeseries vectorized query engines enabled by default
   Vectorized query engines for GroupBy and Timeseries queries were introduced 
in Druid 0.16, as an opt in feature. Since then we have extensively tested 
these engines and feel that the time has come for these improvements to find a 
wider audience. Note that not all of the query engine is vectorized at this 
time, but this change makes it so that any query which is eligible to be 
vectorized will do so. This feature may still be disabled if you encounter any 
problems by setting `druid.query.vectorize` to `false.
   
   https://github.com/apache/druid/pull/10065
   
   
   ## <a name="19-avro" href="#19-avro">#</a> Druid native batch support for 
Apache Avro Object Container Files
   
   New in Druid 0.19.0, native batch indexing now supports [Apache Avro Object 
Container 
Format](https://avro.apache.org/docs/current/spec.html#Object+Container+Files) 
encoded files, allowing batch ingestion of Avro data without needing an 
external Hadoop cluster. Check out [the 
docs](https://druid.apache.org/docs/0.19.0/ingestion/data-formats.html#avro-ocf)
 for more details
   
   https://github.com/apache/druid/pull/9671
   
   
   ##  <a name="19-sql-input-source" href="#19-sql-input-source">#</a> Updated 
Druid native batch support for SQL databases
   An 'SqlInputSource' has been added in Druid 0.19.0 to work with the new 
native batch ingestion specifications first introduced in Druid 0.17, 
deprecating the 
[SqlFirehose](https://druid.apache.org/docs/0.19.0/ingestion/native-batch.html#sqlfirehose).
 Like the 'SqlFirehose' it currently supports MySQL and PostgreSQL, using the 
driver from those extensions. This is a relatively low level ingestion task, 
and the operator must take care to manually ensure that the correct data is 
ingested, either by specially crafting queries to ensure no duplicate data is 
ingested for appends, or ensuring that the entire set of data is queried to be 
replaced when overwriting.
   
   https://github.com/apache/druid/pull/9449
   
   ## <a name="19-ranger" href="#19-ranger">#</a> Apache Ranger based 
authorization 
   A new extension in Druid 0.19.0 adds an Authorizer which implements access 
control for Druid, backed by [Apache Ranger](https://ranger.apache.org/). 
Please see [the extension 
documentation]((https://druid.apache.org/docs/0.19.0/development/extensions-core/druid-ranger-security.html)
 and [Authentication and 
Authorization](https://druid.apache.org/docs/0.19.0/design/auth.html) for more 
information on the basic facilities this extension provides.
   
   https://github.com/apache/druid/pull/9579
   
   ## <a name="19-aliyun-oss" href="#19-aliyun-oss">#</a> Alibaba Object 
Storage Service support
   A new 'contrib' extension has been added for [Alibaba Cloud Object Storage 
Service (OSS)](https://www.alibabacloud.com/product/oss) to provide both deep 
storage and usage as a batch ingestion input source. Since this is a 'contrib' 
extension, it will not be packaged by default in the binary distribution, 
please see [community 
extensions](https://druid.apache.org/docs/0.19.0/development/extensions.html#community-extensions)
 for more details on how to use in your cluster.
   
   https://github.com/apache/druid/pull/9898
   
   ## <a name="19-gce-autoscale" href="#19-gce-autoscale">#</a> Overlord 
autoscaling for Google Compute Engine
   Another 'contrib' extension new in 0.19.0 has been added to support overlord 
autoscaling for Google Compute Engine. Unlike the Amazon Web Services overlord 
autoscaling, which provisions and terminates instances directly, the GCE 
autoscaler uses [Managed Instance 
Groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups)
 to more closely align with how operators are likely to provision their 
clusters. Like other 'contrib' extensions, it will not be packaged by default 
in the binary distribution, please see [community 
extensions](https://druid.apache.org/docs/0.19.0/development/extensions.html#community-extensions)
 for more details on how to use in your cluster.
   
   https://github.com/apache/druid/pull/8987
   
   ## <a name="19-regexp-like" href="#19-regexp-like">#</a> REGEXP_LIKE
   A new `REGEXP_LIKE` function has been added to Druid SQL and native 
expressions, which behaves similar to `LIKE`, except using regular expressions 
for the pattern.
   
   https://github.com/apache/druid/pull/9893
   
   ## <a name="19-web-lookup" href="#19-web-lookup">#</a> Web console lookup 
management improvements
   todo
   
   ## <a name="19-datasource-loadstatus" href="#19-datasource-loadstatus">#</a> 
New Coordinator per datasource 'loadstatus' API 
   A coordinator API can make it easier to determine if the latest published 
segments are available for querying. This is similar to the existing 
coordinator 'loadstatus' API, but is datasource specific, may specify an 
interval, and can optionally live refresh the metadata store snapshot to get 
the latest up to date information. Note that operators should still exercise 
caution when using this API to query large numbers of segments, especially if 
forcing a metadata refresh, as it can potentially be a 'heavy' call on large 
clusters.
   
   https://github.com/apache/druid/pull/9965
   
   
   
   ## <a name="19-batch-append" href="#19-batch-append">#</a> Native batch 
append support for range and hash partitioning
   Part bug fix, part new feature, Druid native batch (once again) supports 
appending new data to existing time chunks when those time chunks were 
partitioned with 'hash' or 'range' partitioning algorithms. Note that currently 
the appended segments only support 'dynamic' partitioning, and when rolling 
back to older versions that these appended segments will not be recognized by 
Druid after the downgrade. In order to roll back to a previous version, these 
appended segments should be compacted with the rest of the time chunk in order 
to have a homogenous partitioning scheme.
   
   https://github.com/apache/druid/pull/10033
   
   
   
   # <a name="19-bugs" href="#19-bugs">#</a> Bug fixes
   Druid 0.19.0 contains 65 bug fixes, you can see [the complete list 
here](https://github.com/apache/druid/pulls?q=is%3Apr+milestone%3A0.19.0+is%3Aclosed+label%3ABug).
   
   ## <a name="19-partition-non-atomic" href="#19-partition-non-atomic">#</a> 
Fix for batch ingested 'dynamic' partitioned segments not becoming queryable 
atomically
   Druid 0.19.0 fixes an important query correctness issue, where 'dynamic' 
partitioned segments produced by a batch ingestion task were not tracking the 
overall number of partitions. This had the implication that when these segments 
came online, they did not do so as a complete set, but rather as individual 
segments, meaning that there would be periods of swapping where results could 
be queried from mixed sets of segment versions within a time chunk.
    
   https://github.com/apache/druid/pull/10025
   
   ## <a name="19-partition-empty-buckets" 
href="#19-partition-empty-buckets">#</a> Fix to allow 'hash' and 'range' 
partitioned segments with empty buckets to now be queryable
   Prior to 0.19.0, Druid had a bug when using hash or ranged partitioning 
where if data skew was such that any of the buckets were 'empty' after 
ingesting, the partitions would never be recognized as 'complete' and so never 
become queryable. Druid 0.19.0 fixes this issue by adjusting the schema of the 
partitioning spec. These changes to the json format should be backwards 
compatible, however rolling back to a previous version will again make these 
segments no longer queryable.
   
   https://github.com/apache/druid/pull/10012
   
   
   ## <a name="19-bad-balancer" href="#19-bad-balancer">#</a> Incorrect 
balancer behavior
   A bug in Druid versions prior to 0.19.0 allowed for (incorrect) coordinator 
operation in the event `druid.server.maxSize` was not set. This bug would allow 
segments to load, and effectively randomly balance them in the cluster 
(regardless of what balancer strategy was actually configured) if all 
historicals did not have this value set. This bug has been fixed, but as a 
result `druid.server.maxSize` must be set to the sum of the segment cache 
location sizes for historicals, or else they will not load segments.
   
   # <a name="19-upgrading-from-previous" 
href="#19-upgrading-from-previous">#</a> Upgrading to Druid 0.19.0
   Please be aware of the following issues when upgrading from 0.18.1 to 
0.19.0. If you're updating from an earlier version than 0.18.1, please see the 
release notes of the relevant intermediate versions.
   
   ## <a name="19-server-size" href="#19-server-size">#</a> 
'druid.server.maxSize' must now be set for Historical servers
   [A Coordinator bug fix](#19-bad-balancer) as a side-effect now requires 
`druid.server.maxSize` to be set for segments to be loaded. While this value 
should have been set correctly for previous versions, please be sure this value 
is configured correctly before upgrading your clusters or else segments will 
not be loaded.
   
   https://github.com/apache/druid/pull/10070
   
   ## <a name="19-sys-segments-payload" href="#19-sys-segments-payload">#</a> 
System tables 'sys.segments' column 'payload' has been removed and replaced 
with 'dimensions', 'metrics', and 'shardSpec'
   The removal of the 'payload' column from the `sys.segments` table should 
make queries on this table much more efficient, and the most useful fields from 
this, the list of 'dimensions', 'metrics', and the 'shardSpec', have been split 
out, and so are still available to devote to processing queries.
   
   https://github.com/apache/druid/pull/9883
   
   
   ## <a name="19-segment-load-threads" href="#19-segment-load-threads">#</a> 
Changed default number of segment loading threads
   The `druid.segmentCache.numLoadingThreads` configuration has had the default 
value changed from 'number of cores' to 'number of cores' divided by 6. This 
should make historicals a bit more well behaved out of the box when loading a 
large number of segments, limiting the impact on query performance.
   
   https://github.com/apache/druid/pull/9856
   
   
   ### <a name="19-broadcast-colocated" href="#19-broadcast-colocated">#</a> 
Broadcast load rules no longer have 'colocated datasources'
   A number of incomplete changes to facilitate more efficient join queries, 
based on the idea of utilizing broadcast load rules to propagate smaller 
datasources among the cluster so that join operations can be pushed down to 
individual segment processing, have been added to 0.19.0. While not a finished 
feature yet, as part of the changes to make this happen, 'broadcast' load rules 
no longer have the concept of 'colocated datasources', which would attempt to 
only broadcast segments to servers that had segments of the configured 
datasource. This didn't work so well in practice, as it was non-atomic, meaning 
that the broadcast segments would lag behind loads and drops of the colocated 
datasource, so we decided to remove it.
   
   https://github.com/apache/druid/pull/9971
   
   
   ## <a name="19-broadcast-broker-load" href="#19-broadcast-broker-load">#</a> 
Brokers and realtime tasks may now be configured to load segments from 
'broadcast' datasources
   Another effect of the afforementioned preliminary work to introduce 
efficient 'broadcast joins', Brokers and realtime indexing tasks will now load 
segments loaded by 'broadcast' rules, if a segment cache is configured. Since 
the feature is not complete there is little reason to do this in 0.19.0, and it 
will not happen unless explicitly configured.
   
   https://github.com/apache/druid/pull/9971
   
   
   ## <a name="19-lpad-rpad" href="#19-lpad-rpad">#</a> lpad and rpad function 
behavior change
   The lpad and rpad functions have gone through a slight behavior change in 
Druids default non-SQL compatible mode, in order to make them behave 
consistently with PostgreSQL. In the new behavior, if the pad expression is an 
empty string, then the result will be the (possibly trimmed) original 
characters, rather than the empty string being treated as a null and coercing 
the results to null.
   
   https://github.com/apache/druid/pull/10006
   
   
   
   
   
   # <a name="19-known-issues" href="#19-known-issues">#</a> Known Issues
   For a full list of open issues, please see 
https://github.com/apache/druid/labels/Bug.
   
   
   # <a name="19-credits" href="#19-credits">#</a> Credits
   
   Thanks to everyone who contributed to this release!
   
   @2bethere
   @a-chumagin
   @a2l007
   @abhishekrb19
   @agricenko
   @alex-plekhanov
   @AlexanderSaydakov
   @awelsh93
   @bolkedebruin
   @calvinhkf
   @capistrant
   @ccaominh
   @chenyuzhi459
   @clintropolis
   @danc
   @dylwylie
   @FrankChen021
   @frnidito
   @gianm
   @harshpreet93
   @jihoonson
   @jon-wei
   @josephglanville
   @kamaci
   @kanibs
   @leerho
   @liujianhuanzz
   @maytasm
   @mcbrewster
   @mghosh4
   @morrifeldman
   @pjain1
   @samarthjain
   @stefanbirkner
   @sthetland
   @suneet-s
   @surekhasaharan
   @tarpdalton
   @viongpanzi
   @vogievetsky
   @willsalz
   @wjhypo
   @xhl0726
   @xiangqiao123
   @xvrl
   @yuanlihan
   @zachjsh
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new issue #10139: [Draft] Druid 0.19.0 Release Notes

Reply via email to