andygrove opened a new issue, #3490:
URL: https://github.com/apache/datafusion-comet/issues/3490
## Background
PR #2521 added memory reservation debug logging (`spark.comet.debug.memory`
config and `LoggingPool` wrapper). That PR also contained Python scripts for
parsing and visualizing the memory debug logs, but those scripts were not
merged. This issue tracks adding analysis/visualization scripts as a follow-up.
## Log Format
When `spark.comet.debug.memory=true` is set, the `LoggingPool` produces log
lines like:
```
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(256232960) returning Ok
[Task 486] MemoryPool[ExternalSorter[6]].try_grow(257820416) returning Err
[Task 486] MemoryPool[ExternalSorterMerge[6]].shrink(10485760)
[Task 486] MemoryPool[ExternalSorterMerge[6]].try_grow(68928) returning Ok
```
## Proposed Scripts
### 1. `dev/scripts/mem_debug_to_csv.py` — Parse logs to CSV
Parses the Spark executor/worker log file, filters by task ID, and tracks
cumulative memory allocation per consumer (operator).
Key details from the #2521 implementation:
- Uses regex to parse lines matching `[Task <id>]
MemoryPool[<consumer>].<method>(<size>)`
- Tracks running total per consumer: `grow`/`try_grow` add to allocation,
`shrink` subtracts
- For `try_grow` failures (line contains "Err"), the allocation is **not**
updated but the row is annotated with an `ERR` label
- Outputs CSV with columns: `name, size, label`
- Accepts `--task <id>` to filter to a specific Spark task and `--file
<path>` for the log file
### 2. `dev/scripts/plot_memory_usage.py` — Visualize memory usage
Reads the CSV output and produces a stacked area chart showing memory usage
over time by consumer (operator).
Key details from the #2521 implementation:
- Uses pandas and matplotlib
- Creates a time index from row order (each row = sequential event)
- Pivots data so each consumer is a column, forward-fills missing values
- Renders a stacked area chart (`plt.stackplot`)
- Annotates `try_grow` failures with red vertical dashed lines labeled "ERR"
- Saves chart as PNG (same path as CSV but with `_chart.png` suffix)
## Suggestions from PR #2521 Code Review
The following review feedback should be incorporated:
1. **Use `#!/usr/bin/env python3` shebang** and make scripts executable
(`chmod +x`)
2. **Fix CSV formatting** — use f-strings
(`f"{consumer},{alloc[consumer]}"`) instead of `print(consumer, ",",
alloc[consumer])` to avoid extra spaces around values
3. **Fix ERR label handling** — the original implementation printed two rows
for the same event on `try_grow` failure (one with ERR label, one without). Use
a label variable so only one row is printed per event
4. **Handle first occurrence being `shrink`** — the original code assumed
the first event for a consumer is always `grow`/`try_grow`, but the first event
could be a `shrink`
5. **Fix `--task` argument** — `int(None)` fails with TypeError when
`--task` is not provided; make it optional or a positional arg
6. **Consider making `--file` a positional argument** for simpler CLI usage
7. **Use `pandas.DataFrame.ffill()`** instead of deprecated
`fillna(method='ffill')` (deprecated since pandas 2.1.0)
8. **Consider logging backtraces** — when the backtrace feature is enabled,
it could be useful to log backtraces on every call (not just errors) to trace
precise allocation origins. This was suggested as an optional `trace!`-level
enhancement to the Rust `LoggingPool`
## Example Workflow
```shell
# Step 1: Run Spark with memory debug logging enabled
spark-submit --conf spark.comet.debug.memory=true ...
# Step 2: Parse the log and generate CSV for a specific task
python3 dev/scripts/mem_debug_to_csv.py --task 486 /path/to/executor/log >
/tmp/mem.csv
# Step 3: Generate a chart
python3 dev/scripts/plot_memory_usage.py /tmp/mem.csv
```
## Reference
- PR #2521: https://github.com/apache/datafusion-comet/pull/2521
- Example charts from #2521 showing stacked memory usage per operator with
ERR annotations for failed `try_grow` calls
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]