sascha-coenen opened a new issue #9321: Performance degradation in topN queries 
when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321
 
 
   ### Affected Version
   0.16.0
   
   ### Description
   
   Given a Druid v0.16.0 cluster configured with SQL-compatible null handling 
enabled,
   the intial performance we measured was inconspicuous, but after a while, 
there would
   be a drastic performance degradation for topN queries.
   
   After much testing we found out that initial performance of a freshly 
started Druid 
   cluster would be consistently fast UNTIL a groupbyV2 query gets exececuted 
for the 
   first time. 
   After that the performance of topN queries would degrade by 70% or more.
   This degradation is specific to topN queries and also seems to apply
   only to heavy topN queries (8 aggregations, several sequential passes).
   
   We looked at any operational metric we have but could not find a root cause 
for the degradation.
   The degradation would not fade out with time. Also, a forced full garbage 
collection would not 
   recover any performance.
   Furthermore the execution of a single groupbV2 query, any groupbyV2 query 
seems to trigger the
   degradation.
   
   We have a performance testsuite and a metrics dashboard.
   In the screenshots from the perf testsuite below you can see the degradation
   in topn queries after the execution of the first groupbyV2 query in the 
before/after
   view.
   
   Furthermore, the dashboard shows a different test we performed to illustrate 
the 
   performance degradation: we initially sent a sequential stream of topN 
queries to a
   freshly started Druid cluster for a long time. Then we issued a single 
groupbyV2 query
   while the stream of topN queries would continue. One can clearly see how 
performance
   degrades immediately and is constant before and after. 
   The dashboard shows the segment-scan-time metrics to illustrate that the 
degradation
   happens on the historicals by way of decreased scan times.
   
   In the attempt to hone in on root-causes, we ran further tests that had 
subsystems of Druid disabled:
   * disabling metric emission
   * disabling log emission
   * disabling all caches
   However, in all these cases the performance degradation remained.
   
   As we keep sending the same query many times, we can also rule out effects 
caused by disk access because the segments needed for serving the query would 
be paged into memory.
   
   Then we turned off the SQL-compatible null handling and the performance 
issue was gone.
   
   <img width="970" alt="perf-suite-before-after" 
src="https://user-images.githubusercontent.com/1635350/73939413-af121880-48e9-11ea-8223-5789d7582112.png";>
   
   <img width="1020" 
alt="performance-issue-topn-groupby-sqlcompatiblenullhandling" 
src="https://user-images.githubusercontent.com/1635350/73939765-70c92900-48ea-11ea-963b-748438c1374a.png";>
   
   We haven't tested yet whether the issue still remains with Druid 0.17.0 
because it will take us a while to upgrade, but I meant to report the issue as 
early as possible. We have no idea what could be causing this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to