goldjee opened a new issue, #38009:
URL: https://github.com/apache/superset/issues/38009

   ### Bug description
   
   Superset silently corrupts PostgreSQL Unicode escape string literals 
(`U&'...'`), breaking charts with "Data error". sqlglot misparses U& as 
identifier `U` + bitwise AND `&`, adding spaces during SQL regeneration. 
Multiple code paths route SQL through `Dialect.generate()`. This affects 
expression sanitization, virtual dataset RLS, SQL execution, and dialect 
transpilation.
   
   ## Reproduction steps
   
   1. Connect a PostgreSQL database
   2. Create a virtual dataset with SQL containing a Unicode escape string, 
e.g.:
   ```sql
   SELECT (U&'\FE01' || 'Test literal') AS label
   ```
   3. Create a chart using this dataset
   4. Observe the chart fails with: `Error: column "u" does not exist`
   
   ## Expected behavior
   
   The Unicode escape literal `U&'\FE01'` should be preserved as-is and sent to 
PostgreSQL without modification.
   
   ## Actual behavior
   
   The U&'\FE01' expression is corrupted to U & '\FE01' (spaces added around 
&), which PostgreSQL interprets as: column u (bitwise AND) string literal 
'\FE01'.
   
   ### Screenshots/recordings
   
   _No response_
   
   ### Superset version
   
   master / latest-dev
   
   ### Python version
   
   3.11
   
   ### Node version
   
   18 or greater
   
   ### Browser
   
   Chrome
   
   ### Additional context
   
   Sqlglot does not have tokenizer/parser support for PostgreSQL's U&'...' 
syntax. When Superset passes SQL through `Dialect.generate()`, it reconstructs 
the SQL with the U and & treated as separate tokens.
   
   This can be confirmed directly:
   
   ```python
   import sqlglot
   from sqlglot.dialects.dialect import Dialect
   
   sql = r"SELECT (U&'\FE01' || 'test')"
   dialect = Dialect.get_or_raise('postgres')
   
   for stmt in sqlglot.parse(sql, dialect='postgres'):
       result = dialect.generate(stmt, copy=True, pretty=True)
       print(result)
   # Output: SELECT (U & '\FE01' || 'test')  -- CORRUPTED
   ```
   
   ## Possible fixes
   
   1. Workaround on SQL level: replace `U&'\FE01' || '...'` with `E'\uFE01' || 
'...'` or `chr(65025) || '...'`.
   2. Upstream fix in sqlglot: Add `U&'...'` support to sqlglot's PostgreSQL 
tokenizer/parser (the proper long-term fix).
   3. Avoid unnecessary reformatting: Where `Dialect.generate()` is called only 
for validation (not transformation), return the original SQL instead of the 
regenerated version. Only use the regenerated SQL when actual transformations 
(RLS injection, transpilation) were applied.
   4. Make known query limitations opaque by adding context help to SQL query 
input fields.
   
   ### Checklist
   
   - [x] I have searched Superset docs and Slack and didn't find a solution to 
my problem.
   - [x] I have searched the GitHub issue tracker and didn't find a similar bug 
report.
   - [x] I have checked Superset's logs for errors and if I found a relevant 
Python stacktrace, I included it here as text in the "additional context" 
section.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to