Spark): poc - add support for Apache Spark over HTTP (non-Arrow) [arrow-adbc]

via GitHub Tue, 27 Aug 2024 09:46:29 -0700


CurtHagenlocher commented on code in PR #2018:
URL: https://github.com/apache/arrow-adbc/pull/2018#discussion_r1733122750



##########
csharp/src/Drivers/Apache/Spark/README.md:
##########
@@ -24,23 +24,26 @@
 Properties should be passed in the call to `SparkDriver.Open`,
 but can also be passed in the call to `AdbcDatabase.Connect`.
 
-| Property            | Description | Default |
-| :---                | :---        | :---    |
-| `adbc.spark.host`   | Host name for the data source. Do no include scheme or 
port number. Example: `sparkserver.region.cloudapp.azure.com` |  |
-| `adbc.spark.port`   | The port number the data source is listen on for new 
connections. | `443` |
-| `adbc.spark.path`   | The URI path on the data source server. Example: 
`sql/protocolv1/o/0123456789123456/01234-0123456-source` | |
-| `adbc.spark.token`  | For token-based authentication, the token to be 
authenticated on the data source. Example: `abcdef0123456789` | |
-<!-- Add these properties when basic authentication is available.
-| `adbc.spark.scheme` | The HTTP or HTTPS scheme to use. Allowed values: 
`http`, `https`. | `https` - when port is 443 or empty, `http`, otherwise. |
-| `auth_type`         | An indicator of the intended type of authentication. 
Allowed values: `basic`, `token`. This property is optional. The authentication 
type can be inferred from `token`, `username`, and `password`. If a `token` 
value is provided, token authentication is used. Otherwise, if both `username` 
and `password` values are provided, basic authentication is used. | |
-| `username`          | The user name used for basic authentication | |
-| `password`          | The password for the user name used for basic 
authentication. | |
--->
+| Property               | Description | Default |
+| :---                   | :---        | :---    |
+| `adbc.spark.type`      | (Required) Indicates the Spark server type. One of 
`databricks`, `http` (future: `standard`, `hdinsight`) | |
+| `adbc.spark.auth_type` | An indicator of the intended type of 
authentication. Allowed values: `none`, `username_only`,_`basic`, and `token`. 
This property is optional. The authentication type can be inferred from 
`token`, `username`, and `password`. If a `token` value is provided, token 
authentication is used. Otherwise, if both `username` and `password` values are 
provided, basic authentication is used. | |

Review Comment:
   nit: underscore before `basic` looks like a typo



##########
csharp/src/Drivers/Apache/Spark/README.md:
##########
@@ -24,23 +24,26 @@
 Properties should be passed in the call to `SparkDriver.Open`,
 but can also be passed in the call to `AdbcDatabase.Connect`.
 
-| Property            | Description | Default |
-| :---                | :---        | :---    |
-| `adbc.spark.host`   | Host name for the data source. Do no include scheme or 
port number. Example: `sparkserver.region.cloudapp.azure.com` |  |
-| `adbc.spark.port`   | The port number the data source is listen on for new 
connections. | `443` |
-| `adbc.spark.path`   | The URI path on the data source server. Example: 
`sql/protocolv1/o/0123456789123456/01234-0123456-source` | |
-| `adbc.spark.token`  | For token-based authentication, the token to be 
authenticated on the data source. Example: `abcdef0123456789` | |
-<!-- Add these properties when basic authentication is available.
-| `adbc.spark.scheme` | The HTTP or HTTPS scheme to use. Allowed values: 
`http`, `https`. | `https` - when port is 443 or empty, `http`, otherwise. |
-| `auth_type`         | An indicator of the intended type of authentication. 
Allowed values: `basic`, `token`. This property is optional. The authentication 
type can be inferred from `token`, `username`, and `password`. If a `token` 
value is provided, token authentication is used. Otherwise, if both `username` 
and `password` values are provided, basic authentication is used. | |
-| `username`          | The user name used for basic authentication | |
-| `password`          | The password for the user name used for basic 
authentication. | |
--->
+| Property               | Description | Default |
+| :---                   | :---        | :---    |
+| `adbc.spark.type`      | (Required) Indicates the Spark server type. One of 
`databricks`, `http` (future: `standard`, `hdinsight`) | |
+| `adbc.spark.auth_type` | An indicator of the intended type of 
authentication. Allowed values: `none`, `username_only`,_`basic`, and `token`. 
This property is optional. The authentication type can be inferred from 
`token`, `username`, and `password`. If a `token` value is provided, token 
authentication is used. Otherwise, if both `username` and `password` values are 
provided, basic authentication is used. | |
+| `adbc.spark.host`      | Host name for the data source. Do no include scheme 
or port number. Example: `sparkserver.region.cloudapp.azure.com` |  |

Review Comment:
   nit: typos `do no` for `Does not`



##########
csharp/src/Drivers/Apache/Impala/ImpalaConnection.cs:
##########
@@ -26,26 +27,31 @@
 
 namespace Apache.Arrow.Adbc.Drivers.Apache.Impala
 {
-    public class ImpalaConnection : HiveServer2Connection
+    internal class ImpalaConnection : HiveServer2Connection
     {
         internal ImpalaConnection(IReadOnlyDictionary<string, string> 
properties)
             : base(properties)
         {
         }
 
-        protected override ValueTask<TProtocol> CreateProtocolAsync()
+        protected override Task<TTransport> CreateTransportAsync()
         {
-            string hostName = properties["HostName"];
+            string hostName = Properties["HostName"];
             string? tmp;
             int port = 21050; // default?

Review Comment:
   Consider lifting to a named constant



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Reader.cs:
##########
@@ -0,0 +1,87 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*    http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+using System;
+using System.Linq;
+using System.Threading;
+using System.Threading.Tasks;
+using Apache.Arrow.Ipc;
+using Apache.Hive.Service.Rpc.Thrift;
+using Thrift.Protocol;
+using Thrift.Transport.Client;
+using Thrift;
+
+namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
+{
+    internal class HiveServer2Reader : IArrowArrayStream
+    {
+        HiveServer2Statement? _statement;
+        private readonly long _batchSize;
+        //int _counter;

Review Comment:
   remove?



##########
csharp/src/Drivers/Apache/Spark/README.md:
##########
@@ -24,23 +24,26 @@
 Properties should be passed in the call to `SparkDriver.Open`,
 but can also be passed in the call to `AdbcDatabase.Connect`.
 
-| Property            | Description | Default |
-| :---                | :---        | :---    |
-| `adbc.spark.host`   | Host name for the data source. Do no include scheme or 
port number. Example: `sparkserver.region.cloudapp.azure.com` |  |
-| `adbc.spark.port`   | The port number the data source is listen on for new 
connections. | `443` |
-| `adbc.spark.path`   | The URI path on the data source server. Example: 
`sql/protocolv1/o/0123456789123456/01234-0123456-source` | |
-| `adbc.spark.token`  | For token-based authentication, the token to be 
authenticated on the data source. Example: `abcdef0123456789` | |
-<!-- Add these properties when basic authentication is available.
-| `adbc.spark.scheme` | The HTTP or HTTPS scheme to use. Allowed values: 
`http`, `https`. | `https` - when port is 443 or empty, `http`, otherwise. |
-| `auth_type`         | An indicator of the intended type of authentication. 
Allowed values: `basic`, `token`. This property is optional. The authentication 
type can be inferred from `token`, `username`, and `password`. If a `token` 
value is provided, token authentication is used. Otherwise, if both `username` 
and `password` values are provided, basic authentication is used. | |
-| `username`          | The user name used for basic authentication | |
-| `password`          | The password for the user name used for basic 
authentication. | |
--->
+| Property               | Description | Default |
+| :---                   | :---        | :---    |
+| `adbc.spark.type`      | (Required) Indicates the Spark server type. One of 
`databricks`, `http` (future: `standard`, `hdinsight`) | |
+| `adbc.spark.auth_type` | An indicator of the intended type of 
authentication. Allowed values: `none`, `username_only`,_`basic`, and `token`. 
This property is optional. The authentication type can be inferred from 
`token`, `username`, and `password`. If a `token` value is provided, token 
authentication is used. Otherwise, if both `username` and `password` values are 
provided, basic authentication is used. | |
+| `adbc.spark.host`      | Host name for the data source. Do no include scheme 
or port number. Example: `sparkserver.region.cloudapp.azure.com` |  |
+| `adbc.spark.port`      | The port number the data source is listen on for 
new connections. | `443` |

Review Comment:
   nit: grammar `is listening on` or `listens on`



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs:
##########
@@ -27,21 +26,21 @@
 
 namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
 {
-    public abstract class HiveServer2Connection : AdbcConnection
+    internal abstract class HiveServer2Connection : AdbcConnection
     {
-        const string userAgent = "AdbcExperimental/0.0";
+        internal const long BatchSizeDefault = 50000;
+        internal const int PollTimeMillisecondsDefault = 500;
+        private const string userAgent = "AdbcExperimental/0.0";
 
         protected TOperationHandle? operationHandle;
-        protected readonly IReadOnlyDictionary<string, string> properties;
-        internal TTransport? transport;
-        internal TCLIService.Client? client;
-        internal TSessionHandle? sessionHandle;
+        internal TTransport? _transport;
+        private TCLIService.Client? _client;
         private readonly Lazy<string> _vendorVersion;
         private readonly Lazy<string> _vendorName;
 
-        internal HiveServer2Connection(IReadOnlyDictionary<string, string> 
properties)
+        public HiveServer2Connection(IReadOnlyDictionary<string, string> 
properties)

Review Comment:
   Is there any reason to prefer public over internal? Of course, if the class 
does end up being public then a bunch of these would need to go back to being 
internal.



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs:
##########
@@ -27,21 +26,21 @@
 
 namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
 {
-    public abstract class HiveServer2Connection : AdbcConnection
+    internal abstract class HiveServer2Connection : AdbcConnection

Review Comment:
   FlightSqlConnection, BigQueryConnection, etc. are public; why should the 
Hive/Impala/Spark classes be internal?
   
   (I'm not convinced either way; just arguing for consistency. 
ImportedAdbcConnection is a different category of thing in that it doesn't 
represent a connection to a specific type of database and isn't likely to have 
any db-specific members -- but even there, I'm not entirely convinced that it 
should be different.)



##########
csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TDoubleColumn.cs:
##########
@@ -78,14 +79,19 @@ public TDoubleColumn DeepCopy()
                   var _list178 = await 
iprot.ReadListBeginAsync(cancellationToken);
                   length = _list178.Count;
 
-                  buffer = new byte[length * 8];
+                  buffer = new byte[length * DoubleSize];
                   var memory = buffer.AsMemory();
                   var typedMemory = Unsafe.As<Memory<byte>, Memory<long>>(ref 
memory).Slice(0, length);
                   iprot.Transport.CheckReadBytesAvailable(buffer.Length);
                   await transport.ReadExactlyAsync(memory, cancellationToken);
                   for (int _i179 = 0; _i179 < length; ++_i179)
                   {
+#if NET6_0_OR_GREATER
                     typedMemory.Span[_i179] = 
BinaryPrimitives.ReverseEndianness(typedMemory.Span[_i179]);
+#else
+                    long source = 
BinaryPrimitives.ReverseEndianness(BitConverter.ToInt64(buffer, _i179 * 
DoubleSize));
+                    BitConverter.GetBytes(source).CopyTo(buffer, _i179 * 
DoubleSize);

Review Comment:
   This is pretty expensive because it needs to allocate an array. Consider 
doing something with unsafe code.



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Reader.cs:
##########
@@ -0,0 +1,87 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*    http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+using System;
+using System.Linq;
+using System.Threading;
+using System.Threading.Tasks;
+using Apache.Arrow.Ipc;
+using Apache.Hive.Service.Rpc.Thrift;
+using Thrift.Protocol;
+using Thrift.Transport.Client;
+using Thrift;
+
+namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
+{
+    internal class HiveServer2Reader : IArrowArrayStream
+    {
+        HiveServer2Statement? _statement;

Review Comment:
   nit: private



##########
csharp/src/Drivers/Apache/Spark/README.md:
##########
@@ -66,7 +69,34 @@ The following table depicts how the Spark ADBC driver 
converts a Spark type to a
 | USER_DEFINED         | String     | string |
 | VARCHAR              | String     | string |
 
-\* Complex types are returned as strings<br>
+### Apache Spark over HTTP
+
+| Spark Type           | Arrow Type | C# Type |
+| :---                 | :---:      | :---:   |
+| ARRAY*               | String     | string  |
+| BIGINT               | Int64      | long |
+| BINARY               | Binary     | byte[] |
+| BOOLEAN              | Boolean    | bool |
+| CHAR                 | String     | string |
+| DATE*                | *String*   | *string* |
+| DECIMAL*             | *String*   | *string* |

Review Comment:
   Is this a fundamental limitation or an implementation issue? Are we getting 
back enough information to do the conversion locally when the data is returned 
in a Thrift vs Arrow format?



##########
csharp/src/Drivers/Apache/Thrift/BitmapUtilities.cs:
##########
@@ -0,0 +1,41 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*    http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+using static Apache.Arrow.ArrowBuffer;
+
+namespace Apache.Arrow.Adbc.Drivers.Apache.Thrift
+{
+    internal static class BitmapUtilities
+    {
+        /// <summary>
+        /// Gets the "validity" bitmap buffer from a 'nulls' bitmap.
+        /// </summary>
+        /// <param name="nulls">The bitmap of rows where the value is a null 
value (i.e., "invalid")</param>
+        /// <param name="nullCount">Returns the number of bits set in the 
bitmap.</param>
+        /// <returns>A <see cref="ArrowBuffer"/> bitmap of "valid" rows (i.e., 
not null values).</returns>
+        /// <remarks>Inverts the bits in the incoming bitmap to reverse the 
null to valid indicators.</remarks>
+        internal static ArrowBuffer GetValidityBitmapBuffer(byte[] nulls, out 
int nullCount)
+        {
+            nullCount = BitUtility.CountBits(nulls);

Review Comment:
   This is both incorrect and (probably) slow.
   
   BitUtility.CountBits returns the number of bits that are set, but we 
presumably need to toggle all the bits in the range, whether or not they are 
set. The only way to do this correctly is to pass in the number of elements in 
the array. All of the callers should have this information.
   
   A likely-faster approach is then to do bytewise flipping until we get for 
(count / 8) bytes and then bitwise flipping for the remaining (count % 8) bits. 
We can almost certainly do better by flipping the bits in larger chunks, but at 
the expense of more complicated logic.



##########
csharp/src/Drivers/Apache/Thrift/Service/Rpc/Thrift/TDoubleColumn.cs:
##########
@@ -78,14 +79,19 @@ public TDoubleColumn DeepCopy()
                   var _list178 = await 
iprot.ReadListBeginAsync(cancellationToken);
                   length = _list178.Count;
 
-                  buffer = new byte[length * 8];
+                  buffer = new byte[length * DoubleSize];
                   var memory = buffer.AsMemory();
                   var typedMemory = Unsafe.As<Memory<byte>, Memory<long>>(ref 
memory).Slice(0, length);
                   iprot.Transport.CheckReadBytesAvailable(buffer.Length);
                   await transport.ReadExactlyAsync(memory, cancellationToken);
                   for (int _i179 = 0; _i179 < length; ++_i179)
                   {
+#if NET6_0_OR_GREATER
                     typedMemory.Span[_i179] = 
BinaryPrimitives.ReverseEndianness(typedMemory.Span[_i179]);
+#else
+                    long source = 
BinaryPrimitives.ReverseEndianness(BitConverter.ToInt64(buffer, _i179 * 
DoubleSize));
+                    BitConverter.GetBytes(source).CopyTo(buffer, _i179 * 
DoubleSize);

Review Comment:
   Alternatively, just swap the bytes manually. We already have a non-optimal 
implementation of ReverseEndianI32AtOffset in StreamExtensions.cs. It would be 
trivial to create a similar ReverseEndianI64AtOffset and use it for both 
TDoubleColumn and T64Column.



##########
csharp/src/Drivers/Apache/Hive2/HiveServer2Connection.cs:
##########
@@ -27,21 +26,21 @@
 
 namespace Apache.Arrow.Adbc.Drivers.Apache.Hive2
 {
-    public abstract class HiveServer2Connection : AdbcConnection
+    internal abstract class HiveServer2Connection : AdbcConnection
     {
-        const string userAgent = "AdbcExperimental/0.0";
+        internal const long BatchSizeDefault = 50000;
+        internal const int PollTimeMillisecondsDefault = 500;
+        private const string userAgent = "AdbcExperimental/0.0";

Review Comment:
   appears unused



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(csharp/src/Drivers/Apache/Spark): poc - add support for Apache Spark over HTTP (non-Arrow) [arrow-adbc]

Reply via email to