michael-s-molina commented on PR #36368:
URL: https://github.com/apache/superset/pull/36368#issuecomment-3835015191
@villebro I did a section with AI focused on the database overhead problem
raised by @mistercrunch. AI had great recommendations. Given that no current
task was migrated to GTF yet, feel free to tackle these recommendations in this
PR or in a follow-up if you want a fresh/focused context.
## Database Overhead Analysis for Global Task Framework
### Tables & Indexes
- **`tasks`** table with 8 indexes (dedup_key, status, scope, task_type,
uuid, timestamps, created_by)
- **`task_subscribers`** table with 1 index
### Per-Task Database Operations
| Task Type | DB Writes |
| ----------------------------- | --------- |
| Minimal (no progress updates) | ~5 |
| Typical (5 progress updates) | ~10 |
| Heavy (frequent updates) | 15+ |
### Abort Polling Overhead (without Redis)
When Redis pub/sub is unavailable, the framework falls back to database
polling for abort detection. The current implementation spawns **one polling
thread per abortable task**, each executing:
```sql
SELECT * FROM tasks WHERE uuid = ? AND status IN ('ABORTING', 'ABORTED')
```
With default 10-second intervals, this results in:
- **10 concurrent tasks** = 60 queries/minute
- **100 concurrent tasks** = 600 queries/minute
#### Suggested Optimization: Batched Polling
A centralized coordinator could batch these into a single query:
```sql
SELECT uuid FROM tasks WHERE uuid IN (...) AND status IN ('ABORTING',
'ABORTED')
```
| Approach | Queries/min (10 tasks) | Threads |
| ------------------- | ---------------------- | ------- |
| Current (per-task) | 60 | 10 |
| Batched coordinator | 6 | 1 |
This would reduce both query volume and thread count by ~90%.
### Progress Update Overhead
Each `ctx.update_task()` call executes 1 SELECT + 1 UPDATE. Tasks with tight
loops can generate significant DB load:
```python
for i in range(10000):
process_item(i)
ctx.update_task(progress_percent=i/100) # 10,000 DB writes
```
#### Suggested Optimization: Framework-Level Throttling
The framework should protect itself from write storms rather than relying on
task implementers. `TaskContext.update_task()` could:
1. Track last DB write timestamp
2. Only flush to DB if `min_update_interval` (e.g., 1 second) has elapsed
3. Keep latest values in memory between flushes
4. Always flush on task completion
This allows task code to call `update_task()` freely without worrying about
DB overhead.
### UpdateTaskCommand Overhead
Each `update_task()` call currently executes:
1. `validate()` - 1 SELECT to find task
2. `_execute_update()` - 1 SELECT again (re-fetch under lock)
3. `TaskDAO.update()` - 1 UPDATE
**Total: 2 SELECTs + 1 UPDATE per call**
Additionally, `TaskContext._task` property does a fresh SELECT every access.
#### Suggested Optimizations
1. **Cache task state in TaskContext** - Avoid repeated SELECTs during
execution; task runs single-threaded so caching is safe
2. **Eliminate double SELECT** - `validate()` and `_execute_update()` both
fetch the task; could reuse
3. **Skip locking for progress-only updates** - Distributed lock on every
update is overkill; progress updates don't conflict with submit/cancel
4. **Write-behind for progress** - Queue updates in memory, flush
periodically or on completion (combines with throttling)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]