Databricks): Use ArrowSchema for Response Schema [arrow-adbc]

via GitHub Tue, 15 Jul 2025 08:37:39 -0700


CurtHagenlocher commented on code in PR #3140:
URL: https://github.com/apache/arrow-adbc/pull/3140#discussion_r2207856121



##########
csharp/src/Drivers/Databricks/DatabricksSchemaParser.cs:
##########
@@ -16,15 +16,39 @@
 */
 
 using System;
+using System.IO;
 using Apache.Arrow.Adbc.Drivers.Apache;
 using Apache.Arrow.Adbc.Drivers.Apache.Hive2;
 using Apache.Arrow.Types;
 using Apache.Hive.Service.Rpc.Thrift;
+using Apache.Arrow.Ipc;
 
 namespace Apache.Arrow.Adbc.Drivers.Databricks
 {
     internal class DatabricksSchemaParser : SchemaParser

Review Comment:
   It's become clear that we need to expose lower-level functions from the C# 
Arrow library to allow both schemas and data to be loaded independently of each 
other. This is probably the best option given the current APIs, but this 
approach and the `ChunkStream` approach used by `DatabricksReader` (as well as 
the `ReadRowsStream` used in the BigQuery driver) are less efficient than they 
could be if the lower-level functionality existed.
   
   Tl;dr: this is correct.



##########
csharp/test/Drivers/Databricks/E2E/StatementTests.cs:
##########
@@ -1279,5 +1279,80 @@ public async Task 
OlderDBRVersion_ShouldSetSchemaViaUseStatement()
             Assert.True(foundSchemas.Count == 1, "Should have exactly one 
schema");
         }
 
+        [SkippableTheory]
+        [InlineData("cast(-6 as decimal(3, 1))", "-6.0", 3, 1, "Negative 
decimal with scale")]
+        [InlineData("cast(0 as decimal(1, 0))", "0", 1, 0, "Zero decimal")]
+        [InlineData("cast(123 as decimal(3, 0))", "123", 3, 0, "Positive 
integer decimal")]
+        [InlineData("cast(123456789.123456789 as decimal(18, 9))", 
"123456789.123456789", 18, 9, "High precision decimal")]
+        [InlineData("cast(-123456789.123456789 as decimal(18, 9))", 
"-123456789.123456789", 18, 9, "High precision negative decimal")]
+        public async Task CanExecuteDecimalQuery(string sqlExpression, string 
expectedValueString, int expectedPrecision, int expectedScale, string 
testDescription)
+        {
+            decimal expectedValue = decimal.Parse(expectedValueString);
+            // This tests the bug where older DBR versions return decimal 
values as strings when UseArrowNativeTypes is false
+            // To repro issue, run this with dbr < 10.0
+            OutputHelper?.WriteLine($"Testing: {testDescription}");
+            OutputHelper?.WriteLine($"SQL Expression: {sqlExpression}");
+            OutputHelper?.WriteLine($"Expected Value: {expectedValue}");
+            OutputHelper?.WriteLine($"Expected Precision: {expectedPrecision}, 
Scale: {expectedScale}");
+
+            using AdbcConnection connection = NewConnection();
+            using var statement = connection.CreateStatement();
+
+            // Use the provided SQL expression
+            statement.SqlQuery = $"SELECT {sqlExpression} as A";
+            QueryResult result = statement.ExecuteQuery();
+
+            Assert.NotNull(result.Stream);
+
+            // Verify the schema
+            var schema = result.Stream.Schema;
+            Assert.Single(schema.FieldsList);
+
+            var field = schema.GetFieldByName("A");
+            Assert.NotNull(field);
+
+            OutputHelper?.WriteLine($"Decimal field type: 
{field.DataType.GetType().Name}");
+            OutputHelper?.WriteLine($"Decimal field type ID: 
{field.DataType.TypeId}");
+
+            // Read the actual data
+            var batch = await result.Stream.ReadNextRecordBatchAsync();
+            Assert.NotNull(batch);
+            Assert.Equal(1, batch.Length);
+
+            if (field.DataType is Decimal128Type decimalType)
+            {
+                // For newer DBR versions with UseArrowNativeTypes enabled, 
decimal is returned as Decimal128Type
+                Assert.Equal(expectedPrecision, decimalType.Precision);
+                Assert.Equal(expectedScale, decimalType.Scale);
+
+                var col0 = batch.Column(0) as Decimal128Array;
+                Assert.NotNull(col0);
+                Assert.Equal(1, col0.Length);
+
+                var sqlDecimal = col0.GetSqlDecimal(0);
+                Assert.NotNull(sqlDecimal);
+                Assert.Equal(expectedValue, sqlDecimal.Value);
+
+                OutputHelper?.WriteLine($"Decimal value: {sqlDecimal.Value} 
(precision: {decimalType.Precision}, scale: {decimalType.Scale})");
+            }
+            else if (field.DataType is StringType)

Review Comment:
   I think it's a problem for a decimal table column to be returned as an Arrow 
string. We'll need to apply a conversion after reading the data. It's possible 
that when testing this scenario specifically with Power BI that things will 
work because the ADBC connector itself will (at least sometimes) convert the 
data to match the expected schema. However, this is a little inefficient and is 
clearly not the right behavior for the driver in non-Power BI contexts.
   
   Broadly speaking, I think the right behavior here is for the driver to look 
at both the Thrift schema and the Arrow schema and to come up with a "final" 
schema as well as a collection of transformers to be applied to the individual 
columns. So if the Thrift schema says "decimal" and the Arrow schema says 
"string" then the final schema should say "decimal" and there should be a 
function to convert the string array into a decimal array.



##########
csharp/src/Drivers/Apache/Spark/SparkStatement.cs:
##########
@@ -32,28 +32,9 @@ internal SparkStatement(SparkConnection connection)
 
         protected override void SetStatementProperties(TExecuteStatementReq 
statement)
         {

Review Comment:
   So (at least as of today) the OSS Spark implementation can't return results 
in Arrow format, only as Thrift?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema [arrow-adbc]

Reply via email to