[GitHub] [superset] dvchristianbors opened a new issue, #19567: Significant Increase in Querying Time Due to Sqlparse

GitBox Wed, 06 Apr 2022 08:00:27 -0700


dvchristianbors opened a new issue, #19567:
URL: https://github.com/apache/superset/issues/19567


   When increasing the number of keys in `IN` clauses, the runtime of the query 
is sgnificantly increased. This means that queries with large numbers of keys 
are running for a very long time (several seconds for 200 keys), even though a 
direct query on the underlying database will terminate within milliseconds.
   
   
   See discussion in Slack: 
https://apache-superset.slack.com/archives/C014LS99C1K/p1633448327074000
   
   #### How to reproduce the bug
   Precondition: Load an example dataset
   
   1. Go to 'Data' and open "Explore" _your example dataset_ (e.g., birth_names 
from the example data)
   2. Click on 'Query Mode' > 'Raw Records' and add at least one column
   3. In Filters, click the _Plus_-button to add a new Filter - Custom SQL
   4. Enter a short `IN` query with several keys, e.g. `name IN ("Liam", 
"James", "Noah", "Wyatt", "Gabriel", "Lucas", "Ethan", "Alexander", "Joseph", 
"Benjamin")
   5. Enter a long `IN` query with at hundrets of keys, e.g., `name IN 
("Liam","James", "Noah", "Wyatt", "Gabriel", "Lucas", "Ethan", "Alexander", 
"Joseph", "Benjamin", "William", "Logan", "Mason", "Jack", "John", "Asher", 
"Elijah", "Daniel", "Henry", "Jacob", "Jaxon", "Michael", "Oliver", "Hunter", 
"David", "Levi", "Matthew", "Landon", "Aiden", "Isaac", "Jackson", "Caleb", 
"Ryan", "Elias", "Connor", "Evan", "Joshua", "Samuel", "Christian", "Jayden", 
"Jeremiah", "Cooper", "Eli", "Robert", "Ryder", "Christopher", "Colton", 
"Josiah", "Andrew", "Austin", "Carson", "Jaxson", "Jonathan", "Luke", 
"Malachi", "Nathan", "Owen", "Blake", "Lincoln", "Ezra", "Gavin", "Thomas", 
"Dylan", "Grayson", "Kai", "Ryker", "Zachary", "Anthony", "Isaiah", "Jase", 
"Jason", "Micah", "Sebastian", "Silas", "Titus", "Bentley", "Brody", "Cameron", 
"Carter", "Chase", "Gideon", "Jace", "Sawyer", "Tristan", "Tyler", "Weston", 
"Adam", "Charles", "Everett", "Wesley", "Xander", "Brandon", "Brayden", 
"Nathaniel", "Theod
 ore", "Xavier", "Ashton", "Avery", "Dominic", "Easton", "Finn", "George", 
"Hudson", "Ian", "Jasper", "Kayden", "Marshall", "Max", "Maxwell", "Miles", 
"Orion", "Richard", "Timothy", "Abel", "Drake", "Garrett", "Jameson", "Jayce", 
"Joel", "Kenneth", "Maximus", "Nicholas", "Parker", "Travis", "Cody", "Dean", 
"Declan", "Elliot", "Ezekiel", "Karter", "Nolan", "Patrick", "Riley", "Seth", 
"Solomon", "Steven", "Victor", "Waylon", "Aaron", "August", "Bradley", 
"Braxton", "Bryce", "Calvin", "Camden", "Cayden", "Charlie", "Cole", "Damian", 
"Dawson", "Eric", "Greyson", "Jake", "Jeffrey", "Jesse", "Jonah", "Julian", 
"Kaiden", "Killian", "Kingston", "Maddox", "Matthias", "Maverick", "Odin", 
"Paul", "Peter", "Roman", "Trevor", "Zane", "Alex", "Archer", "Caden", 
"Collin", "Colt", "Edward", "Gage", "Gunner", "Harrison", "Ivan", "Jax", "Leo", 
"Lukas", "Marcus", "Paxton", "Soren", "Sullivan", "Tanner", "Trenton", "Troy", 
"Tucker", "Vincent", "Walter", "Warren", "Adrian", "Augustus", "Axel", 
"Beckett",
  "Cade", "Clayton", "Dante")` (see a list of baby names 
[here](https://www.kaggle.com/datasets/kaggle/us-baby-names?resource=download&select=StateNames.csv)
 
   6. Compare the computation times. Even though an in clause would return the 
data almost instantly
   
   ### Expected results
   
   Computation time is roughly similar, within a few milliseconds.
   
   ### Actual results
   
   In my local setup, the difference is:
   Short `IN` query: 0.34 sec
   Long `IN` query: 1.62 sec
   Even longer `IN` query: 6.07 sec
   
   Upon adding more clauses, the runtime increases in quadratic time.
   
   #### Screenshots
   Short `IN` query (20 keys):
   
![image](https://user-images.githubusercontent.com/84898946/161999780-ab917129-0ba8-45c4-a9b7-1a1cb220c71c.png)
   
   
   Long `IN` query (200 keys):
   
![image](https://user-images.githubusercontent.com/84898946/161999503-e7ae4494-7eb2-440d-af28-5159cf9a26c9.png)
   
   Even longer `IN` query (500 keys):
   
![image](https://user-images.githubusercontent.com/84898946/162000326-56f6dbb3-cc9c-4b17-bfaa-57e102e8a3fe.png)
   
   ### Environment
   
   - browser type and version: Tested on Chrome: Version 100.0.4896.60 
(Official Build) (64-bit)
   - superset version: master branch commit 
03d3eaacafc6ebdad7fdbcef6efa4df553468ba1
   - python version: 3.8.10
   - node.js version: v12.22.9
   - any feature flags active: None
   
   ### Checklist
   
   Make sure to follow these steps before submitting your issue - thank you!
   
   ### Additional context
   The source of this quadratic runtime of the query is caused by both the 
`sqlparse.parse` and `sqlparse.format` functions called in numerous places 
(`/models/core.py`, `db_engine_specs/base.py`, `connectors/sqla/models.py`, and 
`common/query_object.py`)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [superset] dvchristianbors opened a new issue, #19567: Significant Increase in Querying Time Due to Sqlparse

Reply via email to