Hi, all!
I just wanted to give a heads up that SIP-117
(github.com/apache/superset/issues/26786), "Improve SQL parsing", has
been fully implemented. We now have all the codebase using a single
parser library (`sqlglot`) through two new classes: `SQLScript` and
`SQLStatement` (a script is a sequence of statements).
With this change, the SQL parsing in Superset is now dialect-dependent.
Of the 60 engines we support, 41 have dedicated dialects. Adding new
dialects is relatively easy, and during the work for SIP-117 I created a
Druid dialect (contributed upstream to `sqlglot`) and two dialects for
Firebolt (maintained in the Superset repo). Better yet, all SQL parsing
functionality is now contained in these 2 classes, with 100% test
coverage. If we ever need to change the parser in the future we only
have to modify these classes and run the test suite to make sure
everything still works as expected.
The work for SIP-117 took almost 6 months, 18 PRs, and added
approximately 600 lines of code and 800 lines of tests. While it's easy
to forget that Superset even does SQL parsing, it's a critical part of
our codebase. For example, parsing SQL is needed in order to set (or
update) limits in queries, preventing too much data from being loaded
into the UI. And while this might seem simple, keep in mind different
databases have different syntaxes for it:
SELECT * FROM t LIMIT 10
SELECT TOP 10 * FROM t
SELECT * FROM t FETCH FIRST 10 ROWS ONLY
More importantly, SQL parsing is critical for security. It's used to
identify which tables are being accessed when a query runs, so that
Superset can enforce data access roles (DAR). It's used to detect
malicious use of functions that can expose data, as well as the
malicious use of subqueries in ad-hoc expressions. And it's used to
modify arbitrarily complex queries in place, injecting row-level
security (RLS) filters.
I'd like to thanks all the contributors who helped with this SIP,
especially Vitor Ávila, Elizabeth Thompson, Antonio Rivero, and Max
Beauchemin.
--Beto