[PR] feat(streaming): Streaming CSV uploads for over 100k records for constant memory usage [superset]

via GitHub Fri, 07 Nov 2025 07:39:28 -0800


amaannawab923 opened a new pull request, #35478:
URL: https://github.com/apache/superset/pull/35478


   CSV_STREAMING_ROW_THRESHOLD=100000
   
   # 🔧 Summary
   This PR introduces **streaming CSV export** functionality for SQL Lab, 
enabling efficient exports of large query result sets with real-time progress 
tracking. Users can now export millions of rows without timeouts or memory 
errors, while seeing live progress updates during the download.
   
   
https://github.com/user-attachments/assets/cc1dba33-1876-483b-979e-68571ff49759
   
   
   ## 🐛 The Issue
   When exporting large SQL Lab query results or Charts to CSV, users face 
several critical problems:
   
   - **Memory crashes**: Exporting 500K+ rows loads the entire dataset into 
memory at once, often crashing the browser tab
   - **Timeout failures**: Large exports take 2-3 minutes with no feedback, 
frequently timing out before completion
   - **No progress indication**: Users have no idea if the export is working or 
stuck, leading to multiple retry attempts
   - **Unreliable for production use**: Data analysts avoid CSV export for 
large datasets, resorting to manual workarounds
   
   For example, a data analyst trying to export a 500,000-row financial report 
would experience a frozen browser, no progress bar, and likely a timeout error 
after waiting several minutes.
   
   ## 🧠 Root Cause
   The current CSV export loads **all rows into memory at once** before 
generating the file. This means:
   
   - A 500,000-row export needs 500+ MB of memory loaded simultaneously
   - The browser tab is blocked until the entire file is ready
   - Network timeouts kill the request before large exports finish
   - There's no way to show progress since nothing streams to the user until 
completion
   
   The existing architecture simply cannot scale beyond a few thousand rows 
reliably.
   
   ## ✅ The Fix
   This PR implements a complete streaming CSV export system that:
   
   ### Backend Improvements
   - **New streaming endpoint** (`/api/v1/sqllab/export_streaming/`) that sends 
data in chunks rather than all at once
   - **Server-side cursors** fetch rows from the database in batches of 1,000 
instead of loading everything upfront
   - **Progressive response streaming** sends 64KB chunks to the browser as 
they're generated, keeping the connection alive
   - **Smart session management** prevents database connection issues during 
long-running exports
   
   ### Frontend Enhancements
   - **Streaming Export Modal** shows real-time progress with row count, file 
size, and download speed
   - **Automatic threshold detection** using the `CSV_STREAMING_ROW_THRESHOLD` 
config (default: 1,000 rows)
   - **Small exports use traditional method** (instant download) while large 
exports automatically stream with progress
   - **Cancel button** lets users abort long-running exports without closing 
the browser tab
   
   ### Configuration
   Administrators can control when streaming activates by setting 
`CSV_STREAMING_ROW_THRESHOLD` in `superset_config.py`. This same threshold is 
used for both Chart and SQL Lab exports, providing consistent behavior across 
the platform.
   
   ## 🧪 How It Works
   
   1. User clicks "Download to CSV" in SQL Lab
   2. Frontend checks if the result set is larger than 
`CSV_STREAMING_ROW_THRESHOLD` (default 100k rows)
   3. **For small exports**: Works exactly as before - instant download
   4. **For large exports**:
      - Progress modal opens immediately
      - Backend starts streaming CSV data in chunks
   
   The key difference is that data flows continuously from database → backend → 
browser in manageable chunks, rather than accumulating in memory all at once.
   
   ## 🖼️ UX Enhancements
   
   ### After (with streaming):
   - Click "Download to CSV" on large result set
   - **Progress modal appears instantly** showing export has started
   - **Live progress bar** with percentage
   - **No browser freezing, no timeouts, no crashes**
   
   Users now have full visibility and control over large exports, with the 
confidence that multi-million row downloads will complete successfully.
   
   ## 🎯 Impact
   
   ### Memory Usage
   - **Before**: Entire dataset loaded in memory (500K rows = 500+ MB)
   - **After**: Constant 64KB buffer regardless of dataset size
   
   
   ### User Experience
   - **Before**: No feedback, frequent failures, manual workarounds required
   - **After**: Clear progress, reliable completion, professional export 
experience
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(streaming): Streaming CSV uploads for over 100k records for constant memory usage [superset]

Reply via email to