blarghmatey opened a new issue, #40583:
URL: https://github.com/apache/superset/issues/40583

   ## Description
   
   Chart CSV exports fail with `__STREAM_ERROR__:Export failed` for any chart 
whose `row_limit` meets or exceeds `CSV_STREAMING_ROW_THRESHOLD` (default: 
100,000), when the underlying database is Trino (and likely others).
   
   ## Environment
   
   - Superset version: **6.1.0**
   - Database backend: **Trino**
   - `impersonate_user`: disabled on the database connection
   - `GLOBAL_ASYNC_QUERIES`: disabled
   - Chart type: **table** viz, `query_mode: raw`, `server_pagination: false`
   - Chart `row_limit`: **100,000** (equal to default 
`CSV_STREAMING_ROW_THRESHOLD`)
   
   ## Steps to Reproduce
   
   1. Create or open a table chart with `row_limit` ≥ 
`CSV_STREAMING_ROW_THRESHOLD` (default 100,000) backed by a Trino database
   2. Click **Download → .csv** in the Explore or Dashboard view
   3. Observe the download fails; the file contains `__STREAM_ERROR__:Export 
failed. Please try again in some time.`
   4. The Superset pod logs show the caught exception (logged at ERROR level by 
the generator)
   
   ## Root Cause Analysis
   
   ### Trigger: `_should_use_streaming` threshold check
   
   In `superset/charts/data/api.py`, `_should_use_streaming()` gates the new 
`StreamingCSVExportCommand` path:
   
   ```python
   threshold = app.config.get("CSV_STREAMING_ROW_THRESHOLD", 100000)
   # Falls back to row_limit when actual rowcount is unavailable
   # (non-paginated table charts don't populate queries[1])
   actual_row_count = int(row_limit) if row_limit else 0
   return actual_row_count is not None and actual_row_count >= threshold
   ```
   
   For a non-paginated table chart (`server_pagination: false`), the rowcount 
is not in the second query slot, so the code falls back to `row_limit`. A chart 
with `row_limit == 100000` therefore **always** triggers the streaming path.
   
   ### Bug: generator pushes a new app context over the active request context
   
   `BaseStreamingCSVExportCommand.run()` returns a callable that executes 
inside a **new** app context when iterated by Flask during response streaming:
   
   ```python
   def csv_generator() -> Generator[str, None, None]:
       with self._current_app.app_context():   # <-- pushes NEW app context
           with preserve_g_context(captured_g):
               try:
                   yield from self._execute_query_and_stream(sql, database, 
limit)
               except Exception as e:
                   logger.error("Error in streaming CSV generator: %s", e)
                   yield "__STREAM_ERROR__:Export failed. Please try again in 
some time.\n"
   ```
   
   `preserve_g_context` restores `g.__dict__` from the original request, but 
the new app context creates a fresh Flask-SQLAlchemy scoped session scope. 
Inside `_execute_query_and_stream`:
   
   ```python
   with db.session() as session:
       merged_database = session.merge(database)   # database from original 
session
       with merged_database.get_sqla_engine() as engine:
           with engine.connect() as connection:
               result_proxy = connection.execution_options(
                   stream_results=True
               ).execute(text(sql))
   ```
   
   The combination of (1) a fresh scoped session in the nested app context 
merging an object loaded in the original request session, (2) 
`stream_results=True` passed to the Trino SQLAlchemy driver (which may not 
support server-side cursors in the expected way), and (3) any 
`DB_CONNECTION_MUTATOR` or `get_sqla_engine()` behaviour relying on 
request-scoped state produces an exception that is silently swallowed and 
returned to the client as `__STREAM_ERROR__`.
   
   **The actual traceback is logged at ERROR level** by the generator but the 
error message that reaches the user provides no debugging information.
   
   ## Additional Observations
   
   - The streaming path **re-runs the Trino query from scratch** even though 
`ChartDataCommand.run()` already fetched the full result set moments earlier. 
For charts at the threshold boundary (e.g., `row_limit == 100000`), data is 
fetched twice.
   - The non-streaming in-memory `CsvResponse` path works correctly for the 
same chart, user, and dataset.
   - `can_export_streaming_csv` appears in the `ab_permission` table of the 
Superset metadata DB but **does not exist anywhere in the 6.1.0 Python 
codebase** and is never checked by any endpoint. This phantom permission causes 
confusion when diagnosing access errors.
   
   ## Workaround
   
   Set `CSV_STREAMING_ROW_THRESHOLD` above the maximum practical `row_limit` in 
`superset_config.py`:
   
   ```python
   # Workaround: disable streaming path until validated against all DB drivers
   CSV_STREAMING_ROW_THRESHOLD = 1_000_000
   ```
   
   ## Suggested Fixes
   
   1. **Remove the nested app context push.** The generator already runs within 
the active request context during Flask response streaming; a second app 
context is unnecessary and causes scoped session conflicts. Use the existing 
request-scoped `g` and session directly.
   
   2. **Avoid re-fetching data.** `_should_use_streaming` receives the full 
`result` dict already loaded by `ChartDataCommand.run()`. The streaming path 
could stream from that in-memory result rather than re-issuing a new database 
query.
   
   3. **Surface the actual exception** in the error marker or at minimum 
include a request/trace ID, rather than a generic message with no debugging 
context.
   
   4. **Fix the threshold comparison**: use `>` instead of `>=` so that a chart 
with `row_limit` exactly equal to `CSV_STREAMING_ROW_THRESHOLD` uses the proven 
in-memory path. Or document that the threshold must be set *above* the maximum 
expected `row_limit`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to