Spark): poc - add support for Apache Spark over HTTP (non-Arrow) [arrow-adbc]

via GitHub Tue, 03 Sep 2024 09:55:06 -0700


CurtHagenlocher commented on code in PR #2018:
URL: https://github.com/apache/arrow-adbc/pull/2018#discussion_r1742365974



##########
csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TBinaryColumn.cs:
##########
@@ -109,8 +110,12 @@ public TBinaryColumn DeepCopy()
                     await transport.ReadExactlyAsync(tmp.AsMemory(0, size), 
cancellationToken);
                     values.Append(tmp.AsMemory(0, size).Span);
                   }
+#if NET6_0_OR_GREATER

Review Comment:
   Why does this require .NET 6.0 or greater?



##########
csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TBinaryColumn.cs:
##########
@@ -83,14 +84,14 @@ public TBinaryColumn DeepCopy()
 
                   values = new ArrowBuffer.Builder<byte>();
                   int offset = 0;
-                  offsetBuffer = new byte[(length + 1) * 4];
+                  offsetBuffer = new byte[(length + 1) * IntSize];
                   var memory = offsetBuffer.AsMemory();
                   var typedMemory = Unsafe.As<Memory<byte>, Memory<int>>(ref 
memory).Slice(0, length + 1);
 
                   for(int _i197 = 0; _i197 < length; ++_i197)
                   {
                     //typedMemory.Span[_i197] = offset;
-                    StreamExtensions.WriteInt32LittleEndian(offset, 
memory.Span, _i197 * 4);
+                    StreamExtensions.WriteInt32LittleEndian(offset, 
memory.Span, _i197 * IntSize);

Review Comment:
   For .NET 6+ we could use e.g. BinaryPrimitives.WriteInt32LittleEndian for 
what I suspect is an efficiency gain. This could be done as a followup and/or 
performance-tested.



##########
csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TBinaryColumn.cs:
##########
@@ -83,14 +84,14 @@ public TBinaryColumn DeepCopy()
 
                   values = new ArrowBuffer.Builder<byte>();
                   int offset = 0;
-                  offsetBuffer = new byte[(length + 1) * 4];
+                  offsetBuffer = new byte[(length + 1) * IntSize];
                   var memory = offsetBuffer.AsMemory();
                   var typedMemory = Unsafe.As<Memory<byte>, Memory<int>>(ref 
memory).Slice(0, length + 1);

Review Comment:
   We don't seem to be using `typedMemory` any more. Applies to some of the 
other column types as well.



##########
csharp/src/Drivers/Apache/Spark/README.md:
##########
@@ -66,7 +69,34 @@ The following table depicts how the Spark ADBC driver 
converts a Spark type to a
 | USER_DEFINED         | String     | string |
 | VARCHAR              | String     | string |
 
-\* Complex types are returned as strings<br>
+### Apache Spark over HTTP
+
+| Spark Type           | Arrow Type | C# Type |
+| :---                 | :---:      | :---:   |
+| ARRAY*               | String     | string  |
+| BIGINT               | Int64      | long |
+| BINARY               | Binary     | byte[] |
+| BOOLEAN              | Boolean    | bool |
+| CHAR                 | String     | string |
+| DATE*                | *String*   | *string* |
+| DECIMAL*             | *String*   | *string* |

Review Comment:
   I think the correct thing is to do the conversion locally, yes. I think it 
would be very surprising for a user to run a query that produces a decimal or 
date value and to get that back as a string. I understand that we've been 
avoiding this for structured types where it's perhaps a little more justifiable 
but this just seems wrong.
   
   One thing we could do to unblock this checkin is to add an option for 
converting scalar types and to fail if the value is not explicitly passed as 
false. This would let us add support in a subsequent change without breaking 
consumers.



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs:
##########
@@ -27,21 +26,21 @@
 
 namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
 {
-    public abstract class HiveServer2Connection : AdbcConnection
+    internal abstract class HiveServer2Connection : AdbcConnection
     {
-        const string userAgent = "AdbcExperimental/0.0";
+        internal const long BatchSizeDefault = 50000;
+        internal const int PollTimeMillisecondsDefault = 500;
+        private const string userAgent = "AdbcExperimental/0.0";
 
         protected TOperationHandle? operationHandle;
-        protected readonly IReadOnlyDictionary<string, string> properties;
-        internal TTransport? transport;
-        internal TCLIService.Client? client;
-        internal TSessionHandle? sessionHandle;
+        internal TTransport? _transport;
+        private TCLIService.Client? _client;
         private readonly Lazy<string> _vendorVersion;
         private readonly Lazy<string> _vendorName;
 
-        internal HiveServer2Connection(IReadOnlyDictionary<string, string> 
properties)
+        public HiveServer2Connection(IReadOnlyDictionary<string, string> 
properties)

Review Comment:
   I don't feel strongly but as a practical matter if we ever wanted to make 
the class public to expose Hive-specific functionality then we'd just need to 
reverse them all again.



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs:
##########
@@ -27,21 +26,21 @@
 
 namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
 {
-    public abstract class HiveServer2Connection : AdbcConnection
+    internal abstract class HiveServer2Connection : AdbcConnection

Review Comment:
   If for some reason there were driver-specific functionality that we wanted 
to expose then the class would need to be public. That said, nothing prevents 
this from being changed later, and defaulting to publishing only the types we 
need to is probably best.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(csharp/src/Drivers/Apache/Spark): poc - add support for Apache Spark over HTTP (non-Arrow) [arrow-adbc]

Reply via email to