[PR] [DO NOT MERGE] fix(csharp/databricks): Reduce LZ4 decompression memory by 96% [arrow-adbc]

via GitHub Thu, 30 Oct 2025 17:18:42 -0700


eric-wang-1990 opened a new pull request, #3654:
URL: https://github.com/apache/arrow-adbc/pull/3654


   # [DO NOT MERGE] Fix: Reduce LZ4 decompression memory usage by 96%
   
   ## ⚠️ Discussion Required
   
   This PR reduces **accumulated memory allocations** (total allocations over 
time) but may not significantly reduce **peak concurrent memory usage**. 
Requires discussion on:
   - Whether pooling provides enough benefit vs. complexity
   - Impact on real-world concurrent scenarios
   - Trade-offs between allocation count and peak memory
   
   ## Summary
   
   Reduces LZ4 internal buffer memory allocation from ~900MB to ~40MB (96% 
reduction) for large Databricks query results by implementing a custom 
ArrayPool that supports buffer sizes larger than .NET's default 1MB limit.
   
   **Important**: This optimization primarily reduces:
   - **Total allocations**: 222 × 4MB → reuse of 10 pooled buffers
   - **GC pressure**: Fewer LOH allocations → fewer Gen2 collections
   
   But does NOT significantly reduce:
   - **Peak concurrent memory**: With `parallelDownloads=1`, peak is still 
~8-16MB (1-2 buffers in use)
   
   ## Problem
   
   - **Observed**: ADBC C# driver allocated **900MB** vs ODBC's **30MB** for 
the same query
   - **Root Cause**: Databricks uses LZ4 frames with 4MB `maxBlockSize`, but 
.NET's `ArrayPool<byte>.Shared` has a hardcoded 1MB limit
   - **Impact**: 222 decompression operations × 4MB fresh allocations = 888MB 
LOH allocations
   
   ### Profiler Evidence
   ```
   Object Type: byte[]
   Count: 222 allocations
   Total Size: 931 MB
   Allocation: Large Object Heap (LOH)
   Source: K4os.Compression.LZ4 internal buffer allocation
   ```
   
   ## Solution
   
   Created a custom ArrayPool by overriding K4os.Compression.LZ4's buffer 
allocation methods:
   
   1. **CustomLZ4FrameReader.cs** - Extends `StreamLZ4FrameReader` with custom 
ArrayPool (4MB max, 10 buffers)
   2. **CustomLZ4DecoderStream.cs** - Stream wrapper using 
`CustomLZ4FrameReader`
   3. **Updated Lz4Utilities.cs** - Use `CustomLZ4DecoderStream` instead of 
default `LZ4Stream.Decode()`
   
   ### Key Implementation
   ```csharp
   // CustomLZ4FrameReader.cs
   private static readonly ArrayPool<byte> LargeBufferPool =
       ArrayPool<byte>.Create(
           maxArrayLength: 4 * 1024 * 1024,    // 4MB (matches Databricks' 
maxBlockSize)
           maxArraysPerBucket: 10               // Pool capacity: 10 × 4MB = 
40MB
       );
   
   protected override byte[] AllocBuffer(int size)
   {
       return LargeBufferPool.Rent(size);
   }
   
   protected override void ReleaseBuffer(byte[] buffer)
   {
       if (buffer != null)
       {
           LargeBufferPool.Return(buffer, clearArray: false);
       }
   }
   ```
   
   ## Results
   
   ### Memory Usage
   | Approach | Allocations | Total Memory | Notes |
   |----------|-------------|--------------|-------|
   | Before | 222 × 4MB fresh | 888MB | LOH, no pooling |
   | After | Reuse 1-2 from pool | ~8-40MB | Pooled, reused |
   | **Reduction** | **-220 allocs** | **-848MB (96%)** | |
   
   ### Performance
   - **CPU**: No degradation (pooling reduces allocation overhead)
   - **GC**: Significantly reduced Gen2 collections (fewer LOH allocations)
   - **Latency**: Slight improvement (buffer reuse faster than fresh allocation)
   
   ## Why This Works
   
   **K4os Library Design**:
   - `LZ4FrameReader` has `virtual` methods: `AllocBuffer()` and 
`ReleaseBuffer()`
   - Default implementation calls `BufferPool.Alloc()` → 
`ArrayPool<byte>.Shared` (1MB limit)
   - Overriding allows injection of custom 4MB pool
   
   **Buffer Lifecycle**:
   1. Decompression needs 4MB buffer → Rent from pool
   2. Decompression completes → Return to pool
   3. Next decompression → Reuse buffer from pool
   4. With `parallelDownloads=1` (default), only 1-2 buffers active at once
   
   ## Concurrency Considerations
   
   | parallel_downloads | Buffers Needed | Pool Sufficient? |
   |-------------------|----------------|------------------|
   | 1 (default) | 1-2 × 4MB | ✅ Yes |
   | 4 | 4-8 × 4MB | ✅ Yes |
   | 8 | 8-16 × 4MB | ⚠️ Borderline |
   | 16+ | 16-32 × 4MB | ❌ No (exceeds pool capacity) |
   
   **Recommendation**: If using `parallel_downloads > 4`, consider increasing 
`maxArraysPerBucket` in future enhancement.
   
   ## Files Changed
   
   ### New Files
   - `src/Drivers/Databricks/CustomLZ4FrameReader.cs` (~80 lines)
   - `src/Drivers/Databricks/CustomLZ4DecoderStream.cs` (~118 lines)
   
   ### Modified Files
   - `src/Drivers/Databricks/Lz4Utilities.cs` - Use `CustomLZ4DecoderStream`, 
add telemetry
   
   ## Testing
   
   ### Validation
   - ✅ Profiler confirms 96% memory reduction (900MB → 40MB)
   - ✅ Build passes on all targets (net6.0, net7.0, net8.0)
   - ✅ Telemetry events show buffer allocation metrics
   - ✅ Stress testing with large queries (200+ decompressions)
   
   ### Telemetry
   Added `lz4.decompress_async` activity event:
   ```json
   {
     "compressed_size_bytes": 32768,
     "actual_size_bytes": 4194304,
     "buffer_allocated_bytes": 4194304,
     "compression_ratio": 128.0
   }
   ```
   
   ## Technical Decisions
   
   ### Why Override Instead of Fork?
   - ✅ Maintains upstream compatibility
   - ✅ Minimal code (~200 lines vs entire library)
   - ✅ Inherits K4os optimizations/bug fixes
   - ⚠️ Relies on virtual methods remaining virtual
   
   ### Why ArrayPool.Create()?
   - ✅ Built-in .NET primitive (well-tested, thread-safe)
   - ✅ Simple API (Rent/Return)
   - ⚠️ Less control over eviction policies
   
   ### Why 4MB maxArrayLength?
   - Databricks uses 4MB `maxBlockSize` - pool matches exactly
   - ArrayPool rounds up to power-of-2 bucket sizes internally
   - Pool: 10 × 4MB = 40MB max (reasonable footprint)
   
   ### Why 10 maxArraysPerBucket?
   - Default `parallelDownloads=1` uses 1-2 buffers
   - Pool of 10 provides margin for:
     - Concurrent operations
     - Prefetching
     - Multiple queries
   
   ## Future Enhancements
   
   1. **Dynamic pool sizing** based on `parallel_downloads` config
   2. **Pool metrics** telemetry (hit rate, utilization, peak usage)
   3. **Adaptive maxArrayLength** based on observed `maxBlockSize`
   4. **Warnings** when pool capacity insufficient
   
   ## References
   
   - 
[K4os.Compression.LZ4](https://github.com/MiloszKrajewski/K4os.Compression.LZ4)
   - [LZ4 Frame Format 
Spec](https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md)
   - [.NET ArrayPool 
Docs](https://learn.microsoft.com/en-us/dotnet/api/system.buffers.arraypool-1)
   - [LOH Best 
Practices](https://learn.microsoft.com/en-us/dotnet/standard/garbage-collection/large-object-heap)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [DO NOT MERGE] fix(csharp/databricks): Reduce LZ4 decompression memory by 96% [arrow-adbc]

Reply via email to