sascha-coenen opened a new issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled URL: https://github.com/apache/druid/issues/9321 ### Affected Version 0.16.0 ### Description Given a Druid v0.16.0 cluster configured with SQL-compatible null handling enabled, the intial performance we measured was inconspicuous, but after a while, there would be a drastic performance degradation for topN queries. After much testing we found out that initial performance of a freshly started Druid cluster would be consistently fast UNTIL a groupbyV2 query gets exececuted for the first time. After that the performance of topN queries would degrade by 70% or more. This degradation is specific to topN queries and also seems to apply only to heavy topN queries (8 aggregations, several sequential passes). We looked at any operational metric we have but could not find a root cause for the degradation. The degradation would not fade out with time. Also, a forced full garbage collection would not recover any performance. Furthermore the execution of a single groupbV2 query, any groupbyV2 query seems to trigger the degradation. We have a performance testsuite and a metrics dashboard. In the screenshots from the perf testsuite below you can see the degradation in topn queries after the execution of the first groupbyV2 query in the before/after view. Furthermore, the dashboard shows a different test we performed to illustrate the performance degradation: we initially sent a sequential stream of topN queries to a freshly started Druid cluster for a long time. Then we issued a single groupbyV2 query while the stream of topN queries would continue. One can clearly see how performance degrades immediately and is constant before and after. The dashboard shows the segment-scan-time metrics to illustrate that the degradation happens on the historicals by way of decreased scan times. In the attempt to hone in on root-causes, we ran further tests that had subsystems of Druid disabled: * disabling metric emission * disabling log emission * disabling all caches However, in all these cases the performance degradation remained. As we keep sending the same query many times, we can also rule out effects caused by disk access because the segments needed for serving the query would be paged into memory. Then we turned off the SQL-compatible null handling and the performance issue was gone. <img width="970" alt="perf-suite-before-after" src="https://user-images.githubusercontent.com/1635350/73939413-af121880-48e9-11ea-8223-5789d7582112.png"> <img width="1020" alt="performance-issue-topn-groupby-sqlcompatiblenullhandling" src="https://user-images.githubusercontent.com/1635350/73939765-70c92900-48ea-11ea-963b-748438c1374a.png"> We haven't tested yet whether the issue still remains with Druid 0.17.0 because it will take us a while to upgrade, but I meant to report the issue as early as possible. We have no idea what could be causing this.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
