dkranchii commented on issue #55609: URL: https://github.com/apache/spark/issues/55609#issuecomment-4347848868
@nellaivijay thanks for the report. Before we discuss a fix, could you help confirm the bug is reproducible? ### The repro doesn't OOM on a default config On a default executor (1 GB heap, `spark.memory.fraction=0.6`, `spark.memory.storageFraction=0.5`) this fits in storage memory. And `df.cache()` uses `MEMORY_AND_DISK` by default, so even if it didn't fit, Spark would spill to disk rather than OOM. ### Spark already has memory-aware caching The proposal description suggests Spark "loads cache blocks without monitoring memory pressure," but that's not quite accurate: - `MemoryStore.putIteratorAsValues` does **progressive unrolling** and checks memory availability incrementally; it aborts/spills when storage memory can't be acquired. - `UnifiedMemoryManager` + `StorageMemoryPool` track storage memory precisely and evict cached blocks (LRU) when execution memory needs space. - For remote block fetches, `spark.reducer.maxSizeInFlight`, `spark.reducer.maxBlocksInFlightPerAddress`, and `spark.network.maxRemoteBlockSizeFetchToMem` already throttle. A new mechanism would need to show why these existing paths are insufficient for a specific workload. ### Concerns with the proposed design - Using `java.lang.management.MemoryMXBean` (JVM heap usage) is a poor signal for Spark's storage memory. Spark's pools are bookkeeping abstractions that don't map 1:1 to heap usage (off-heap, deserialized vs serialized, execution vs storage), and heap fluctuates with GC. Decisions based on `MemoryMXBean` would be jittery and would conflict with `UnifiedMemoryManager`'s accounting. - "Automatic cache eviction when memory surge detected" during cache population would cause thrashing and break user expectations of cache persistence. Spark's existing eviction (only when execution memory needs the space) was a deliberate design choice. ### What would help move this forward To investigate, could you share: 1. The actual **executor** stack trace from the OOM (not driver). 2. The executor config you're running with (`spark.executor.memory`, `spark.executor.cores`, `spark.memory.fraction`, `spark.memory.storageFraction`). 3. The real data scale where you see the issue — the 100k-row example doesn't reproduce on a default config. 4. If possible, a heap dump (`-XX:+HeapDumpOnOutOfMemoryError`) showing what's retaining memory at OOM time. With that, we can identify which code path is actually under pressure and propose a targeted fix. Happy to help dig in once we have a concrete reproducer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
