nellaivijay opened a new issue, #55609:
URL: https://github.com/apache/spark/issues/55609

   
   Description:
   
   Problem Description:
   
   When caching large DataFrames in Spark, the block fetching mechanism can 
cause excessive memory pressure, leading to:
   
     • Out-of-memory (OOM) errors in production workloads
     • Performance degradation due to memory thrashing
     • Unpredictable memory usage patterns during cache operations
     • Unnecessary disk spilling even when data could fit in memory
   
   Expected Behavior: Spark should monitor memory pressure during block 
fetching operations and adaptively adjust fetching behavior to prevent 
excessive memory pressure while maintaining
   cache performance.
   
   Actual Behavior: Spark's current block fetching mechanism loads cache blocks 
without monitoring memory pressure. When caching large datasets:
   
     • Spark attempts to fetch many blocks simultaneously
     • Memory usage spikes rapidly during block fetching
     • No adaptive mechanism to slow down or batch the fetching
     • Results in memory pressure that can crash applications
   
   Reproduction Steps:
   
   // Create a large DataFrame
   val data = (1 until 100000).map(i => (i, f"Data_{i}", "A" * 1000))
   val df = spark.createDataFrame(data).toDF("id", "data", "description")
   
   // Cache the DataFrame - causes memory pressure
   df.cache()
   df.count() // Materialize the cache
   
   // Perform operations on cached data
   df.filter($"id" > 50000).count()
   
   Environment:
   
     • Spark Version: 3.5.0+
     • Component: BlockManager / Memory Management
     • Deployment: Standalone, YARN, Kubernetes
     • Impact: All environments where large DataFrames are cached
   
   Observed Behavior:
   
     • Memory usage spikes during cache materialization
     • No adaptive mechanism to control memory pressure during block fetching
     • System may run out of memory for large cached datasets
     • Performance degradation due to memory pressure
   
   Proposed Solution: Implement memory-aware progressive block fetching with:
   
     1. Real-time memory pressure monitoring using JVM MemoryManagement APIs
     2. Adaptive batch sizing based on current memory conditions
     3. Progressive loading of cache blocks in manageable batches
     4. Automatic cache eviction when memory surge detected
     5. Configuration options for fine-grained control
   
   Configuration Options:
   
   spark.memory.awareBlockFetching.enabled true
   spark.memory.awareBlockFetching.initialBatchSize 1000
   spark.memory.awareBlockFetching.pressureThreshold 0.7
   spark.memory.awareBlockFetching.autoEviction.enabled true
   
   Benefits:
   
     • Reduces risk of OOM errors during cache operations
     • Provides predictable memory usage patterns
     • Enables graceful degradation under memory pressure
     • Maintains cache performance with adaptive behavior
     • Fully backward compatible (opt-in feature)
   
   Additional Context: This is particularly problematic for:
   
     • Big data applications processing large datasets
     • Machine learning workloads with cached training data
     • ETL pipelines with intermediate caching
     • Any Spark job that caches large DataFrames
   
   The solution uses standard JVM MemoryManagement APIs (no external 
dependencies) and follows Spark's coding standards. It provides a 
production-ready implementation with comprehensive
   testing and configuration options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to