Spark): Add Lz4 compression support to arrow batch reader [arrow-adbc]

via GitHub Thu, 03 Apr 2025 14:09:59 -0700


eric-wang-1990 commented on code in PR #2669:
URL: https://github.com/apache/arrow-adbc/pull/2669#discussion_r2027735213



##########
csharp/src/Drivers/Apache/Spark/SparkDatabricksReader.cs:
##########
@@ -79,6 +91,49 @@ public SparkDatabricksReader(HiveServer2Statement statement, 
Schema schema)
             }
         }
 
+        private async Task ProcessFetchedBatchesAsync(CancellationToken 
cancellationToken)
+        {
+            var batch = this.batches![this.index];
+
+            // Ensure batch data exists
+            if (batch.Batch == null || batch.Batch.Length == 0)
+            {
+                this.index++;
+                return;
+            }
+
+            try
+            {
+                byte[] dataToUse = batch.Batch;
+
+                // If LZ4 compression is enabled, try to decompress the data
+                if (isLz4Compressed)
+                {
+                    try
+                    {
+                        var dataStream = await 
Lz4Utilities.DecompressLz4Async(batch.Batch, cancellationToken);
+                        dataToUse = dataStream.ToArray();
+                        dataStream.Dispose();
+                    }
+                    catch (Exception ex)
+                    {
+                        // If decompression fails, use the original data
+                        System.Diagnostics.Debug.WriteLine($"Failed to 
decompress LZ4 data: {ex.Message}");
+                    }
+                }
+
+                // Always use ChunkStream which ensures proper schema handling
+                this.reader = new ArrowStreamReader(new 
ChunkStream(this.schema, dataToUse));
+            }
+            catch (Exception ex)
+            {
+                // Log any errors and skip this batch
+                System.Diagnostics.Debug.WriteLine($"Error processing batch: 
{ex.Message}");

Review Comment:
   Yeah I was planning to throw, somehow this part got ommited. We do not want 
partial data and should definitely throw here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat(csharp/src/Drivers/Apache/Spark): Add Lz4 compression support to arrow batch reader [arrow-adbc]

Reply via email to