CurtHagenlocher commented on code in PR #3140: URL: https://github.com/apache/arrow-adbc/pull/3140#discussion_r2207856121
########## csharp/src/Drivers/Databricks/DatabricksSchemaParser.cs: ########## @@ -16,15 +16,39 @@ */ using System; +using System.IO; using Apache.Arrow.Adbc.Drivers.Apache; using Apache.Arrow.Adbc.Drivers.Apache.Hive2; using Apache.Arrow.Types; using Apache.Hive.Service.Rpc.Thrift; +using Apache.Arrow.Ipc; namespace Apache.Arrow.Adbc.Drivers.Databricks { internal class DatabricksSchemaParser : SchemaParser Review Comment: It's become clear that we need to expose lower-level functions from the C# Arrow library to allow both schemas and data to be loaded independently of each other. This is probably the best option given the current APIs, but this approach and the `ChunkStream` approach used by `DatabricksReader` (as well as the `ReadRowsStream` used in the BigQuery driver) are less efficient than they could be if the lower-level functionality existed. Tl;dr: this is correct. ########## csharp/test/Drivers/Databricks/E2E/StatementTests.cs: ########## @@ -1279,5 +1279,80 @@ public async Task OlderDBRVersion_ShouldSetSchemaViaUseStatement() Assert.True(foundSchemas.Count == 1, "Should have exactly one schema"); } + [SkippableTheory] + [InlineData("cast(-6 as decimal(3, 1))", "-6.0", 3, 1, "Negative decimal with scale")] + [InlineData("cast(0 as decimal(1, 0))", "0", 1, 0, "Zero decimal")] + [InlineData("cast(123 as decimal(3, 0))", "123", 3, 0, "Positive integer decimal")] + [InlineData("cast(123456789.123456789 as decimal(18, 9))", "123456789.123456789", 18, 9, "High precision decimal")] + [InlineData("cast(-123456789.123456789 as decimal(18, 9))", "-123456789.123456789", 18, 9, "High precision negative decimal")] + public async Task CanExecuteDecimalQuery(string sqlExpression, string expectedValueString, int expectedPrecision, int expectedScale, string testDescription) + { + decimal expectedValue = decimal.Parse(expectedValueString); + // This tests the bug where older DBR versions return decimal values as strings when UseArrowNativeTypes is false + // To repro issue, run this with dbr < 10.0 + OutputHelper?.WriteLine($"Testing: {testDescription}"); + OutputHelper?.WriteLine($"SQL Expression: {sqlExpression}"); + OutputHelper?.WriteLine($"Expected Value: {expectedValue}"); + OutputHelper?.WriteLine($"Expected Precision: {expectedPrecision}, Scale: {expectedScale}"); + + using AdbcConnection connection = NewConnection(); + using var statement = connection.CreateStatement(); + + // Use the provided SQL expression + statement.SqlQuery = $"SELECT {sqlExpression} as A"; + QueryResult result = statement.ExecuteQuery(); + + Assert.NotNull(result.Stream); + + // Verify the schema + var schema = result.Stream.Schema; + Assert.Single(schema.FieldsList); + + var field = schema.GetFieldByName("A"); + Assert.NotNull(field); + + OutputHelper?.WriteLine($"Decimal field type: {field.DataType.GetType().Name}"); + OutputHelper?.WriteLine($"Decimal field type ID: {field.DataType.TypeId}"); + + // Read the actual data + var batch = await result.Stream.ReadNextRecordBatchAsync(); + Assert.NotNull(batch); + Assert.Equal(1, batch.Length); + + if (field.DataType is Decimal128Type decimalType) + { + // For newer DBR versions with UseArrowNativeTypes enabled, decimal is returned as Decimal128Type + Assert.Equal(expectedPrecision, decimalType.Precision); + Assert.Equal(expectedScale, decimalType.Scale); + + var col0 = batch.Column(0) as Decimal128Array; + Assert.NotNull(col0); + Assert.Equal(1, col0.Length); + + var sqlDecimal = col0.GetSqlDecimal(0); + Assert.NotNull(sqlDecimal); + Assert.Equal(expectedValue, sqlDecimal.Value); + + OutputHelper?.WriteLine($"Decimal value: {sqlDecimal.Value} (precision: {decimalType.Precision}, scale: {decimalType.Scale})"); + } + else if (field.DataType is StringType) Review Comment: I think it's a problem for a decimal table column to be returned as an Arrow string. We'll need to apply a conversion after reading the data. It's possible that when testing this scenario specifically with Power BI that things will work because the ADBC connector itself will (at least sometimes) convert the data to match the expected schema. However, this is a little inefficient and is clearly not the right behavior for the driver in non-Power BI contexts. Broadly speaking, I think the right behavior here is for the driver to look at both the Thrift schema and the Arrow schema and to come up with a "final" schema as well as a collection of transformers to be applied to the individual columns. So if the Thrift schema says "decimal" and the Arrow schema says "string" then the final schema should say "decimal" and there should be a function to convert the string array into a decimal array. ########## csharp/src/Drivers/Apache/Spark/SparkStatement.cs: ########## @@ -32,28 +32,9 @@ internal SparkStatement(SparkConnection connection) protected override void SetStatementProperties(TExecuteStatementReq statement) { Review Comment: So (at least as of today) the OSS Spark implementation can't return results in Arrow format, only as Thrift? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org