[I] Add Python scripts for analyzing memory debug logs [datafusion-comet]

via GitHub Wed, 11 Feb 2026 07:38:04 -0800


andygrove opened a new issue, #3490:
URL: https://github.com/apache/datafusion-comet/issues/3490


   ## Background
   
   PR #2521 added memory reservation debug logging (`spark.comet.debug.memory` 
config and `LoggingPool` wrapper). That PR also contained Python scripts for 
parsing and visualizing the memory debug logs, but those scripts were not 
merged. This issue tracks adding analysis/visualization scripts as a follow-up.
   
   ## Log Format
   
   When `spark.comet.debug.memory=true` is set, the `LoggingPool` produces log 
lines like:
   
   ```
   [Task 486] MemoryPool[ExternalSorter[6]].try_grow(256232960) returning Ok
   [Task 486] MemoryPool[ExternalSorter[6]].try_grow(257820416) returning Err
   [Task 486] MemoryPool[ExternalSorterMerge[6]].shrink(10485760)
   [Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(68928) returning Ok
   ```
   
   ## Proposed Scripts
   
   ### 1. `dev/scripts/mem_debug_to_csv.py` — Parse logs to CSV
   
   Parses the Spark executor/worker log file, filters by task ID, and tracks 
cumulative memory allocation per consumer (operator).
   
   Key details from the #2521 implementation:
   - Uses regex to parse lines matching `[Task <id>] 
MemoryPool[<consumer>].<method>(<size>)`
   - Tracks running total per consumer: `grow`/`try_grow` add to allocation, 
`shrink` subtracts
   - For `try_grow` failures (line contains "Err"), the allocation is **not** 
updated but the row is annotated with an `ERR` label
   - Outputs CSV with columns: `name, size, label`
   - Accepts `--task <id>` to filter to a specific Spark task and `--file 
<path>` for the log file
   
   ### 2. `dev/scripts/plot_memory_usage.py` — Visualize memory usage
   
   Reads the CSV output and produces a stacked area chart showing memory usage 
over time by consumer (operator).
   
   Key details from the #2521 implementation:
   - Uses pandas and matplotlib
   - Creates a time index from row order (each row = sequential event)
   - Pivots data so each consumer is a column, forward-fills missing values
   - Renders a stacked area chart (`plt.stackplot`)
   - Annotates `try_grow` failures with red vertical dashed lines labeled "ERR"
   - Saves chart as PNG (same path as CSV but with `_chart.png` suffix)
   
   ## Suggestions from PR #2521 Code Review
   
   The following review feedback should be incorporated:
   
   1. **Use `#!/usr/bin/env python3` shebang** and make scripts executable 
(`chmod +x`)
   2. **Fix CSV formatting** — use f-strings 
(`f"{consumer},{alloc[consumer]}"`) instead of `print(consumer, ",", 
alloc[consumer])` to avoid extra spaces around values
   3. **Fix ERR label handling** — the original implementation printed two rows 
for the same event on `try_grow` failure (one with ERR label, one without). Use 
a label variable so only one row is printed per event
   4. **Handle first occurrence being `shrink`** — the original code assumed 
the first event for a consumer is always `grow`/`try_grow`, but the first event 
could be a `shrink`
   5. **Fix `--task` argument** — `int(None)` fails with TypeError when 
`--task` is not provided; make it optional or a positional arg
   6. **Consider making `--file` a positional argument** for simpler CLI usage
   7. **Use `pandas.DataFrame.ffill()`** instead of deprecated 
`fillna(method='ffill')` (deprecated since pandas 2.1.0)
   8. **Consider logging backtraces** — when the backtrace feature is enabled, 
it could be useful to log backtraces on every call (not just errors) to trace 
precise allocation origins. This was suggested as an optional `trace!`-level 
enhancement to the Rust `LoggingPool`
   
   ## Example Workflow
   
   ```shell
   # Step 1: Run Spark with memory debug logging enabled
   spark-submit --conf spark.comet.debug.memory=true ...
   
   # Step 2: Parse the log and generate CSV for a specific task
   python3 dev/scripts/mem_debug_to_csv.py --task 486 /path/to/executor/log > 
/tmp/mem.csv
   
   # Step 3: Generate a chart
   python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv
   ```
   
   ## Reference
   
   - PR #2521: https://github.com/apache/datafusion-comet/pull/2521
   - Example charts from #2521 showing stacked memory usage per operator with 
ERR annotations for failed `try_grow` calls


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Add Python scripts for analyzing memory debug logs [datafusion-comet]

Reply via email to