sascha-coenen opened a new issue #9414: SQL: severe performance degradation if 
multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414
 
 
   ### Affected Version
   0.16.0 and 0.17.0
   
   ### Description
   We often use the combination of NULLIF() and LOOKUP() functions in 
combination. This works fine for normal dimension columns, but when using a 
multi-valued dimension, then the following combination of NULLIF and LOOKUP 
leads to extremely high segment scan times of 30 seconds or even up to several 
minutes:
   
   ```
   SELECT
      NULLIF(LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid'), 'fallback') AS 
adFormat,
      SUM("count") AS cnt
   FROM "supply-activities" 
   WHERE  (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp 
'2020-01-20 01:00:00')
   GROUP BY 1
   ORDER BY 2
   LIMIT 10
   ```
   
   In contrast, the following query which is merely dropping the surrounding 
NULLIF() function is executing at normal speed:
   
   ```
   SELECT
      LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid') AS adFormat,
      SUM("count") AS cnt
   FROM "supply-activities" 
   WHERE  (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp 
'2020-01-20 01:00:00')
   GROUP BY 1
   ORDER BY 2
   ```
   
   Using NULLIF in combination with LOOKUP on normal dinensions does NOT lead 
to a performance issue.
   Prior to Druid 0.16.0 we did not notice any performance issue but cannot 
guarentee that it was introduced with 0.16 because we might have simply 
overlooked it in prior releases.
   With 0.16.0 and also 0.17.0 the performance issue exists. We tested with SQL 
compatible null handling turned on and off. Performance is bad in both cases. 
As mentioned, the segment scan times are off the charts.
   
   --
   
   I could not demonstrate the issue using the wikipedia dataset because it 
doesn't seem to contain any multi-valued dimensions. Perhaps it would be good 
to have a normative dataset which showcases all field types such that it can be 
used for problem reports that can be reproduced?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to