PRs awaiting review

Samarth Jain Fri, 15 May 2020 17:03:24 -0700

Hi Druid Devs,

I wanted to bring the community's attention to a few PRs that are awaiting
review and what I believe are worthwhile features and fixes to have in OSS.


Add new round robin strategy for loading segments:
https://github.com/apache/druid/pull/9603/
<https://github.com/apache/druid/pull/9603/files>

This PR adds a new strategy that Druid coordinator can use when determining
what segment to load next. The current and the only strategy is to prefer
loading the newer segments first. For data being ingested using a streaming
indexing service, it makes sense to prefer loading the newer segments on
the historicals as it alleviates pressure off the middle manager nodes by
expediting the segment handoff process. In case of batch ingestion also, it
makes sense to prefer loading newer segments first since chances are users
want to be able to query newer data first. However, there are certain cases
where such an approach causes pain. For example - if two different
datasources are ingested with one having newer data compared to the other
one, it is possible that the segments of the second datasource one may not
get loaded for a long time. To make things "fair" the approach added in the
PR instead picks segments by selecting datasources in a round robin
fashion. For each datasource though, the strategy does make sure that the
newer segments are loaded first. We have been running clusters with this
strategy in our clusters for a while now and it has helped our large (order
of a few TBs) ingest use cases quite well.

The second PR is for handling unknown complex types:
https://github.com/apache/druid/pull/9422

Recently, while upgrading our cluster, we ran into an issue where the Druid
SQL functionality broke because an incompatible change was made in an
aggregator extension. While we obviously shouldn't be making any
incompatible changes, it doesn't hurt to guard against it (especially for
folks building in-house Druid extensions) and especially preventing it from
a major functionality like Druid SQL in this case.

The third PR I actually raised today. But would be good to bring to
community's attention as I believe it addresses a long standing issue.
https://github.com/apache/druid/pull/9877
Internally, and I would be surprised if it isn't common out there, we have
lots of hive parquet tables that have the timestamp column of type int
storing the time in the format yyyyMMdd. To ingest such a column as Druid
timestamp, one would expect that specifying a date time format like
"yyyyMMdd" would suffice. Unfortunately, the timestamp parser in Druid
ignores the format when it sees that column is numeric and instead
interprets it as timestamp in millis. So 20200521 in yyyyMMdd format ends
up being interpreted as 20200521 milliseconds which corresponds to the
incorrect datetime value of "Thu Jan 01 1970 05:36:40".

Thanks,
Samarth

PRs awaiting review

Reply via email to