Hi Druid Devs, I wanted to bring the community's attention to a few PRs that are awaiting review and what I believe are worthwhile features and fixes to have in OSS.
Add new round robin strategy for loading segments: https://github.com/apache/druid/pull/9603/ <https://github.com/apache/druid/pull/9603/files> This PR adds a new strategy that Druid coordinator can use when determining what segment to load next. The current and the only strategy is to prefer loading the newer segments first. For data being ingested using a streaming indexing service, it makes sense to prefer loading the newer segments on the historicals as it alleviates pressure off the middle manager nodes by expediting the segment handoff process. In case of batch ingestion also, it makes sense to prefer loading newer segments first since chances are users want to be able to query newer data first. However, there are certain cases where such an approach causes pain. For example - if two different datasources are ingested with one having newer data compared to the other one, it is possible that the segments of the second datasource one may not get loaded for a long time. To make things "fair" the approach added in the PR instead picks segments by selecting datasources in a round robin fashion. For each datasource though, the strategy does make sure that the newer segments are loaded first. We have been running clusters with this strategy in our clusters for a while now and it has helped our large (order of a few TBs) ingest use cases quite well. The second PR is for handling unknown complex types: https://github.com/apache/druid/pull/9422 Recently, while upgrading our cluster, we ran into an issue where the Druid SQL functionality broke because an incompatible change was made in an aggregator extension. While we obviously shouldn't be making any incompatible changes, it doesn't hurt to guard against it (especially for folks building in-house Druid extensions) and especially preventing it from a major functionality like Druid SQL in this case. The third PR I actually raised today. But would be good to bring to community's attention as I believe it addresses a long standing issue. https://github.com/apache/druid/pull/9877 Internally, and I would be surprised if it isn't common out there, we have lots of hive parquet tables that have the timestamp column of type int storing the time in the format yyyyMMdd. To ingest such a column as Druid timestamp, one would expect that specifying a date time format like "yyyyMMdd" would suffice. Unfortunately, the timestamp parser in Druid ignores the format when it sees that column is numeric and instead interprets it as timestamp in millis. So 20200521 in yyyyMMdd format ends up being interpreted as 20200521 milliseconds which corresponds to the incorrect datetime value of "Thu Jan 01 1970 05:36:40". Thanks, Samarth