goldjee opened a new issue, #38009:
URL: https://github.com/apache/superset/issues/38009
### Bug description
Superset silently corrupts PostgreSQL Unicode escape string literals
(`U&'...'`), breaking charts with "Data error". sqlglot misparses U& as
identifier `U` + bitwise AND `&`, adding spaces during SQL regeneration.
Multiple code paths route SQL through `Dialect.generate()`. This affects
expression sanitization, virtual dataset RLS, SQL execution, and dialect
transpilation.
## Reproduction steps
1. Connect a PostgreSQL database
2. Create a virtual dataset with SQL containing a Unicode escape string,
e.g.:
```sql
SELECT (U&'\FE01' || 'Test literal') AS label
```
3. Create a chart using this dataset
4. Observe the chart fails with: `Error: column "u" does not exist`
## Expected behavior
The Unicode escape literal `U&'\FE01'` should be preserved as-is and sent to
PostgreSQL without modification.
## Actual behavior
The U&'\FE01' expression is corrupted to U & '\FE01' (spaces added around
&), which PostgreSQL interprets as: column u (bitwise AND) string literal
'\FE01'.
### Screenshots/recordings
_No response_
### Superset version
master / latest-dev
### Python version
3.11
### Node version
18 or greater
### Browser
Chrome
### Additional context
Sqlglot does not have tokenizer/parser support for PostgreSQL's U&'...'
syntax. When Superset passes SQL through `Dialect.generate()`, it reconstructs
the SQL with the U and & treated as separate tokens.
This can be confirmed directly:
```python
import sqlglot
from sqlglot.dialects.dialect import Dialect
sql = r"SELECT (U&'\FE01' || 'test')"
dialect = Dialect.get_or_raise('postgres')
for stmt in sqlglot.parse(sql, dialect='postgres'):
result = dialect.generate(stmt, copy=True, pretty=True)
print(result)
# Output: SELECT (U & '\FE01' || 'test') -- CORRUPTED
```
## Possible fixes
1. Workaround on SQL level: replace `U&'\FE01' || '...'` with `E'\uFE01' ||
'...'` or `chr(65025) || '...'`.
2. Upstream fix in sqlglot: Add `U&'...'` support to sqlglot's PostgreSQL
tokenizer/parser (the proper long-term fix).
3. Avoid unnecessary reformatting: Where `Dialect.generate()` is called only
for validation (not transformation), return the original SQL instead of the
regenerated version. Only use the regenerated SQL when actual transformations
(RLS injection, transpilation) were applied.
4. Make known query limitations opaque by adding context help to SQL query
input fields.
### Checklist
- [x] I have searched Superset docs and Slack and didn't find a solution to
my problem.
- [x] I have searched the GitHub issue tracker and didn't find a similar bug
report.
- [x] I have checked Superset's logs for errors and if I found a relevant
Python stacktrace, I included it here as text in the "additional context"
section.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]