Databricks): consolidate LZ4 decompression logic and improve resource disposal (#3649)

curth Wed, 29 Oct 2025 13:43:45 -0700

This is an automated email from the ASF dual-hosted git repository.

curth pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-adbc.git



The following commit(s) were added to refs/heads/main by this push:
     new 8a861bb72 feat(csharp/src/Drivers/Databricks): consolidate LZ4 
decompression logic and improve resource disposal (#3649)
8a861bb72 is described below

commit 8a861bb7251084691bb098e9d5526c6c755da118
Author: eric-wang-1990 <[email protected]>
AuthorDate: Wed Oct 29 13:42:23 2025 -0700

    feat(csharp/src/Drivers/Databricks): consolidate LZ4 decompression logic 
and improve resource disposal (#3649)
    
    ## Summary
    This PR consolidates LZ4 decompression code paths and ensures proper
    resource cleanup across both CloudFetch and non-CloudFetch readers in
    the Databricks C# driver.
    
    ## Changes
    - **Lz4Utilities.cs**
      - Add configurable buffer size parameter to `DecompressLz4()`
      - Add async variant `DecompressLz4Async()` for CloudFetch pipeline
      - Add proper `using` statements for MemoryStream disposal
      - Add default buffer size constant (80KB)
    
    - **CloudFetchDownloader.cs**
      - Update to use shared `Lz4Utilities.DecompressLz4Async()`
      - Reduce code duplication (~12 lines consolidated)
      - Improve telemetry with compression ratio calculation
    
    ## Benefits
    - **Code Quality**: Both code paths now share the same decompression
    implementation, reducing duplication
    - **Resource Management**: Explicit MemoryStream disposal improves
    memory hygiene (though GC would handle cleanup)
    - **Maintainability**: Single source of truth for LZ4 decompression
    logic
    - **Consistency**: Same error handling and telemetry patterns across
    both paths
    
    ## Technical Details
    - Default buffer size remains 80KB (81920 bytes) - no behavioral changes
    - Async version returns `(byte[] buffer, int length)` tuple for
    efficient MemoryStream wrapping in CloudFetch
    - Buffer validity preserved after MemoryStream disposal via reference
    counting
    - Maintains cancellation token support in async path
    - No performance impact - purely refactoring and cleanup
    
    ## Testing
    - Existing unit tests pass
    - No functional changes to decompression logic
    - Telemetry output remains consistent
    
    🤖 Generated with [Claude Code](https://claude.com/claude-code)
    
    ---------
    
    Co-authored-by: Claude <[email protected]>
---
 csharp/src/Drivers/Databricks/Lz4Utilities.cs      | 90 ++++++++++++++++++++--
 .../Reader/CloudFetch/CloudFetchDownloader.cs      | 19 +++--
 2 files changed, 94 insertions(+), 15 deletions(-)

diff --git a/csharp/src/Drivers/Databricks/Lz4Utilities.cs 
b/csharp/src/Drivers/Databricks/Lz4Utilities.cs
index 890f0b77b..127a890c3 100644
--- a/csharp/src/Drivers/Databricks/Lz4Utilities.cs
+++ b/csharp/src/Drivers/Databricks/Lz4Utilities.cs
@@ -17,6 +17,8 @@
 
 using System;
 using System.IO;
+using System.Threading;
+using System.Threading.Tasks;
 using K4os.Compression.LZ4.Streams;
 
 namespace Apache.Arrow.Adbc.Drivers.Databricks
@@ -26,6 +28,11 @@ namespace Apache.Arrow.Adbc.Drivers.Databricks
     /// </summary>
     internal static class Lz4Utilities
     {
+        /// <summary>
+        /// Default buffer size for LZ4 decompression operations (80KB).
+        /// </summary>
+        private const int DefaultBufferSize = 81920;
+
         /// <summary>
         /// Decompresses LZ4 compressed data into memory.
         /// </summary>
@@ -33,19 +40,86 @@ namespace Apache.Arrow.Adbc.Drivers.Databricks
         /// <returns>A ReadOnlyMemory containing the decompressed 
data.</returns>
         /// <exception cref="AdbcException">Thrown when decompression 
fails.</exception>
         public static ReadOnlyMemory<byte> DecompressLz4(byte[] compressedData)
+        {
+            return DecompressLz4(compressedData, DefaultBufferSize);
+        }
+
+        /// <summary>
+        /// Decompresses LZ4 compressed data into memory with a specified 
buffer size.
+        /// </summary>
+        /// <param name="compressedData">The compressed data bytes.</param>
+        /// <param name="bufferSize">The buffer size to use for decompression 
operations.</param>
+        /// <returns>A ReadOnlyMemory containing the decompressed 
data.</returns>
+        /// <exception cref="AdbcException">Thrown when decompression 
fails.</exception>
+        public static ReadOnlyMemory<byte> DecompressLz4(byte[] 
compressedData, int bufferSize)
+        {
+            try
+            {
+                using (var outputStream = new MemoryStream())
+                {
+                    using (var inputStream = new MemoryStream(compressedData))
+                    using (var decompressor = LZ4Stream.Decode(inputStream))
+                    {
+                        decompressor.CopyTo(outputStream, bufferSize);
+                    }
+
+                    // Get the underlying buffer and its valid length without 
copying
+                    // The buffer remains valid after MemoryStream disposal 
since we hold a reference to it
+                    byte[] buffer = outputStream.GetBuffer();
+                    return new ReadOnlyMemory<byte>(buffer, 0, 
(int)outputStream.Length);
+                }
+            }
+            catch (Exception ex)
+            {
+                throw new AdbcException($"Failed to decompress LZ4 data: 
{ex.Message}", ex);
+            }
+        }
+
+        /// <summary>
+        /// Asynchronously decompresses LZ4 compressed data into memory.
+        /// Returns the buffer and length as a tuple for efficient wrapping in 
a MemoryStream.
+        /// </summary>
+        /// <param name="compressedData">The compressed data bytes.</param>
+        /// <param name="cancellationToken">Cancellation token for the async 
operation.</param>
+        /// <returns>A tuple containing the decompressed buffer and its valid 
length.</returns>
+        /// <exception cref="AdbcException">Thrown when decompression 
fails.</exception>
+        public static Task<(byte[] buffer, int length)> DecompressLz4Async(
+            byte[] compressedData,
+            CancellationToken cancellationToken = default)
+        {
+            return DecompressLz4Async(compressedData, DefaultBufferSize, 
cancellationToken);
+        }
+
+        /// <summary>
+        /// Asynchronously decompresses LZ4 compressed data into memory with a 
specified buffer size.
+        /// Returns the buffer and length as a tuple for efficient wrapping in 
a MemoryStream.
+        /// </summary>
+        /// <param name="compressedData">The compressed data bytes.</param>
+        /// <param name="bufferSize">The buffer size to use for decompression 
operations.</param>
+        /// <param name="cancellationToken">Cancellation token for the async 
operation.</param>
+        /// <returns>A tuple containing the decompressed buffer and its valid 
length.</returns>
+        /// <exception cref="AdbcException">Thrown when decompression 
fails.</exception>
+        public static async Task<(byte[] buffer, int length)> 
DecompressLz4Async(
+            byte[] compressedData,
+            int bufferSize,
+            CancellationToken cancellationToken = default)
         {
             try
             {
-                var outputStream = new MemoryStream();
-                using (var inputStream = new MemoryStream(compressedData))
-                using (var decompressor = LZ4Stream.Decode(inputStream))
+                using (var outputStream = new MemoryStream())
                 {
-                    decompressor.CopyTo(outputStream);
+                    using (var inputStream = new MemoryStream(compressedData))
+                    using (var decompressor = LZ4Stream.Decode(inputStream))
+                    {
+                        await decompressor.CopyToAsync(outputStream, 
bufferSize, cancellationToken).ConfigureAwait(false);
+                    }
+
+                    // Get the underlying buffer and its valid length without 
copying
+                    // The buffer remains valid after MemoryStream disposal 
since we hold a reference to it
+                    byte[] buffer = outputStream.GetBuffer();
+                    int length = (int)outputStream.Length;
+                    return (buffer, length);
                 }
-                // Get the underlying buffer and its valid length without 
copying
-                return new ReadOnlyMemory<byte>(outputStream.GetBuffer(), 0, 
(int)outputStream.Length);
-                // Note: We're not disposing the outputStream here because 
we're returning its buffer.
-                // The memory will be reclaimed when the ReadOnlyMemory is no 
longer referenced.
             }
             catch (Exception ex)
             {
diff --git 
a/csharp/src/Drivers/Databricks/Reader/CloudFetch/CloudFetchDownloader.cs 
b/csharp/src/Drivers/Databricks/Reader/CloudFetch/CloudFetchDownloader.cs
index f4f3e5d01..8fee61d92 100644
--- a/csharp/src/Drivers/Databricks/Reader/CloudFetch/CloudFetchDownloader.cs
+++ b/csharp/src/Drivers/Databricks/Reader/CloudFetch/CloudFetchDownloader.cs
@@ -492,15 +492,20 @@ namespace 
Apache.Arrow.Adbc.Drivers.Databricks.Reader.CloudFetch
                     try
                     {
                         var decompressStopwatch = Stopwatch.StartNew();
-                        dataStream = new MemoryStream();
-                        using (var inputStream = new MemoryStream(fileData))
-                        using (var decompressor = 
LZ4Stream.Decode(inputStream))
-                        {
-                            await decompressor.CopyToAsync(dataStream, 81920, 
cancellationToken).ConfigureAwait(false);
-                        }
+
+                        // Use shared Lz4Utilities for decompression 
(consolidates logic with non-CloudFetch path)
+                        var (buffer, length) = await 
Lz4Utilities.DecompressLz4Async(
+                            fileData,
+                            cancellationToken).ConfigureAwait(false);
+
+                        // Create the dataStream from the decompressed buffer
+                        dataStream = new MemoryStream(buffer, 0, length, 
writable: false, publiclyVisible: true);
                         dataStream.Position = 0;
                         decompressStopwatch.Stop();
 
+                        // Calculate throughput metrics
+                        double compressionRatio = (double)dataStream.Length / 
actualSize;
+
                         
activity?.AddEvent("cloudfetch.decompression_complete", [
                             new("offset", downloadResult.Link.StartRowOffset),
                             new("sanitized_url", sanitizedUrl),
@@ -509,7 +514,7 @@ namespace 
Apache.Arrow.Adbc.Drivers.Databricks.Reader.CloudFetch
                             new("compressed_size_kb", actualSize / 1024.0),
                             new("decompressed_size_bytes", dataStream.Length),
                             new("decompressed_size_kb", dataStream.Length / 
1024.0),
-                            new("compression_ratio", (double)dataStream.Length 
/ actualSize)
+                            new("compression_ratio", compressionRatio)
                         ]);
 
                         actualSize = dataStream.Length;

(arrow-adbc) branch main updated: feat(csharp/src/Drivers/Databricks): consolidate LZ4 decompression logic and improve resource disposal (#3649)

Reply via email to