dkranchii commented on issue #55609:
URL: https://github.com/apache/spark/issues/55609#issuecomment-4347848868

   @nellaivijay thanks for the report. Before we discuss a fix, could you help 
confirm the bug is reproducible? 
   
   ### The repro doesn't OOM on a default config
   
   
   On a default executor (1 GB heap, `spark.memory.fraction=0.6`, 
`spark.memory.storageFraction=0.5`) this fits in storage memory. And 
`df.cache()` uses `MEMORY_AND_DISK` by default, so even if it didn't fit, Spark 
would spill to disk rather than OOM.
   
   ### Spark already has memory-aware caching
   
   The proposal description suggests Spark "loads cache blocks without 
monitoring memory pressure," but that's not quite accurate:
   
   - `MemoryStore.putIteratorAsValues` does **progressive unrolling** and 
checks memory availability incrementally; it aborts/spills when storage memory 
can't be acquired.
   - `UnifiedMemoryManager` + `StorageMemoryPool` track storage memory 
precisely and evict cached blocks (LRU) when execution memory needs space.
   - For remote block fetches, `spark.reducer.maxSizeInFlight`, 
`spark.reducer.maxBlocksInFlightPerAddress`, and 
`spark.network.maxRemoteBlockSizeFetchToMem` already throttle.
   
   A new mechanism would need to show why these existing paths are insufficient 
for a specific workload.
   
   ### Concerns with the proposed design
   
   - Using `java.lang.management.MemoryMXBean` (JVM heap usage) is a poor 
signal for Spark's storage memory. Spark's pools are bookkeeping abstractions 
that don't map 1:1 to heap usage (off-heap, deserialized vs serialized, 
execution vs storage), and heap fluctuates with GC. Decisions based on 
`MemoryMXBean` would be jittery and would conflict with 
`UnifiedMemoryManager`'s accounting.
   - "Automatic cache eviction when memory surge detected" during cache 
population would cause thrashing and break user expectations of cache 
persistence. Spark's existing eviction (only when execution memory needs the 
space) was a deliberate design choice.
   
   ### What would help move this forward
   
   To investigate, could you share:
   
   1. The actual **executor** stack trace from the OOM (not driver).
   2. The executor config you're running with (`spark.executor.memory`, 
`spark.executor.cores`, `spark.memory.fraction`, 
`spark.memory.storageFraction`).
   3. The real data scale where you see the issue — the 100k-row example 
doesn't reproduce on a default config.
   4. If possible, a heap dump (`-XX:+HeapDumpOnOutOfMemoryError`) showing 
what's retaining memory at OOM time.
   
   With that, we can identify which code path is actually under pressure and 
propose a targeted fix. Happy to help dig in once we have a concrete reproducer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to