[GitHub] [druid] sascha-coenen opened a new issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

GitBox Thu, 06 Feb 2020 05:13:21 -0800

sascha-coenen opened a new issue #9321: Performance degradation in topN queries
when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321

### Affected Version
0.16.0

### Description

Given a Druid v0.16.0 cluster configured with SQL-compatible null handling
enabled,
the intial performance we measured was inconspicuous, but after a while,
there would
be a drastic performance degradation for topN queries.

After much testing we found out that initial performance of a freshly
started Druid
cluster would be consistently fast UNTIL a groupbyV2 query gets exececuted
for the
first time.
After that the performance of topN queries would degrade by 70% or more.
This degradation is specific to topN queries and also seems to apply
only to heavy topN queries (8 aggregations, several sequential passes).

We looked at any operational metric we have but could not find a root cause
for the degradation.
The degradation would not fade out with time. Also, a forced full garbage
collection would not
recover any performance.
Furthermore the execution of a single groupbV2 query, any groupbyV2 query
seems to trigger the
degradation.

We have a performance testsuite and a metrics dashboard.
In the screenshots from the perf testsuite below you can see the degradation
in topn queries after the execution of the first groupbyV2 query in the
before/after
view.

Furthermore, the dashboard shows a different test we performed to illustrate
the
performance degradation: we initially sent a sequential stream of topN
queries to a
freshly started Druid cluster for a long time. Then we issued a single
groupbyV2 query
while the stream of topN queries would continue. One can clearly see how
performance
degrades immediately and is constant before and after.
The dashboard shows the segment-scan-time metrics to illustrate that the
degradation
happens on the historicals by way of decreased scan times.

In the attempt to hone in on root-causes, we ran further tests that had
subsystems of Druid disabled:
* disabling metric emission
* disabling log emission
* disabling all caches
However, in all these cases the performance degradation remained.

As we keep sending the same query many times, we can also rule out effects
caused by disk access because the segments needed for serving the query would
be paged into memory.

Then we turned off the SQL-compatible null handling and the performance
issue was gone.

We haven't tested yet whether the issue still remains with Druid 0.17.0
because it will take us a while to upgrade, but I meant to report the issue as
early as possible. We have no idea what could be causing this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] sascha-coenen opened a new issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Reply via email to