blarghmatey opened a new issue, #40583:
URL: https://github.com/apache/superset/issues/40583
## Description
Chart CSV exports fail with `__STREAM_ERROR__:Export failed` for any chart
whose `row_limit` meets or exceeds `CSV_STREAMING_ROW_THRESHOLD` (default:
100,000), when the underlying database is Trino (and likely others).
## Environment
- Superset version: **6.1.0**
- Database backend: **Trino**
- `impersonate_user`: disabled on the database connection
- `GLOBAL_ASYNC_QUERIES`: disabled
- Chart type: **table** viz, `query_mode: raw`, `server_pagination: false`
- Chart `row_limit`: **100,000** (equal to default
`CSV_STREAMING_ROW_THRESHOLD`)
## Steps to Reproduce
1. Create or open a table chart with `row_limit` ≥
`CSV_STREAMING_ROW_THRESHOLD` (default 100,000) backed by a Trino database
2. Click **Download → .csv** in the Explore or Dashboard view
3. Observe the download fails; the file contains `__STREAM_ERROR__:Export
failed. Please try again in some time.`
4. The Superset pod logs show the caught exception (logged at ERROR level by
the generator)
## Root Cause Analysis
### Trigger: `_should_use_streaming` threshold check
In `superset/charts/data/api.py`, `_should_use_streaming()` gates the new
`StreamingCSVExportCommand` path:
```python
threshold = app.config.get("CSV_STREAMING_ROW_THRESHOLD", 100000)
# Falls back to row_limit when actual rowcount is unavailable
# (non-paginated table charts don't populate queries[1])
actual_row_count = int(row_limit) if row_limit else 0
return actual_row_count is not None and actual_row_count >= threshold
```
For a non-paginated table chart (`server_pagination: false`), the rowcount
is not in the second query slot, so the code falls back to `row_limit`. A chart
with `row_limit == 100000` therefore **always** triggers the streaming path.
### Bug: generator pushes a new app context over the active request context
`BaseStreamingCSVExportCommand.run()` returns a callable that executes
inside a **new** app context when iterated by Flask during response streaming:
```python
def csv_generator() -> Generator[str, None, None]:
with self._current_app.app_context(): # <-- pushes NEW app context
with preserve_g_context(captured_g):
try:
yield from self._execute_query_and_stream(sql, database,
limit)
except Exception as e:
logger.error("Error in streaming CSV generator: %s", e)
yield "__STREAM_ERROR__:Export failed. Please try again in
some time.\n"
```
`preserve_g_context` restores `g.__dict__` from the original request, but
the new app context creates a fresh Flask-SQLAlchemy scoped session scope.
Inside `_execute_query_and_stream`:
```python
with db.session() as session:
merged_database = session.merge(database) # database from original
session
with merged_database.get_sqla_engine() as engine:
with engine.connect() as connection:
result_proxy = connection.execution_options(
stream_results=True
).execute(text(sql))
```
The combination of (1) a fresh scoped session in the nested app context
merging an object loaded in the original request session, (2)
`stream_results=True` passed to the Trino SQLAlchemy driver (which may not
support server-side cursors in the expected way), and (3) any
`DB_CONNECTION_MUTATOR` or `get_sqla_engine()` behaviour relying on
request-scoped state produces an exception that is silently swallowed and
returned to the client as `__STREAM_ERROR__`.
**The actual traceback is logged at ERROR level** by the generator but the
error message that reaches the user provides no debugging information.
## Additional Observations
- The streaming path **re-runs the Trino query from scratch** even though
`ChartDataCommand.run()` already fetched the full result set moments earlier.
For charts at the threshold boundary (e.g., `row_limit == 100000`), data is
fetched twice.
- The non-streaming in-memory `CsvResponse` path works correctly for the
same chart, user, and dataset.
- `can_export_streaming_csv` appears in the `ab_permission` table of the
Superset metadata DB but **does not exist anywhere in the 6.1.0 Python
codebase** and is never checked by any endpoint. This phantom permission causes
confusion when diagnosing access errors.
## Workaround
Set `CSV_STREAMING_ROW_THRESHOLD` above the maximum practical `row_limit` in
`superset_config.py`:
```python
# Workaround: disable streaming path until validated against all DB drivers
CSV_STREAMING_ROW_THRESHOLD = 1_000_000
```
## Suggested Fixes
1. **Remove the nested app context push.** The generator already runs within
the active request context during Flask response streaming; a second app
context is unnecessary and causes scoped session conflicts. Use the existing
request-scoped `g` and session directly.
2. **Avoid re-fetching data.** `_should_use_streaming` receives the full
`result` dict already loaded by `ChartDataCommand.run()`. The streaming path
could stream from that in-memory result rather than re-issuing a new database
query.
3. **Surface the actual exception** in the error marker or at minimum
include a request/trace ID, rather than a generic message with no debugging
context.
4. **Fix the threshold comparison**: use `>` instead of `>=` so that a chart
with `row_limit` exactly equal to `CSV_STREAMING_ROW_THRESHOLD` uses the proven
in-memory path. Or document that the threshold must be set *above* the maximum
expected `row_limit`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]