Databricks): Design of SEA support for Databricks C# driver [arrow-adbc]

via GitHub Tue, 21 Oct 2025 11:42:19 -0700


jadewang-db commented on code in PR #3576:
URL: https://github.com/apache/arrow-adbc/pull/3576#discussion_r2449353111



##########
csharp/src/Drivers/Databricks/statement-execution-api-design.md:
##########
@@ -0,0 +1,2656 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Databricks Statement Execution API Integration Design
+
+## Executive Summary
+
+This document outlines the design for adding Databricks Statement Execution 
API support as an alternative to the current Thrift-based protocol in the 
Databricks ADBC driver.
+
+**Key Benefits**:
+- **Simpler Protocol**: Standard REST/JSON vs complex Thrift binary protocol
+- **Code Reuse**: Leverage existing CloudFetch pipeline with minimal 
refactoring
+- **Backward Compatible**: Existing Thrift implementation continues to work
+
+**Implementation Scope**:
+- **New**: REST API client, statement executor, API models, readers
+- **Modified**: Minimal CloudFetch interface refactoring for protocol 
independence
+- **Reused**: Authentication, tracing, retry logic, download pipeline, memory 
management
+
+## Overview
+
+### Complete Architecture Overview
+
+```mermaid
+graph TB
+    subgraph "Client Layer"
+        App[ADBC Application]
+    end
+
+    subgraph "ADBC Driver"
+        App --> DC[DatabricksConnection]
+        DC --> Cfg{Protocol Config}
+
+        Cfg -->|thrift| ThriftImpl[Thrift Implementation]
+        Cfg -->|rest| RestImpl[REST Implementation]
+
+        subgraph "Thrift Path (Existing)"
+            ThriftImpl --> TSess[Session Management]
+            ThriftImpl --> TStmt[Thrift Statement]
+            TStmt --> TFetch[CloudFetchResultFetcher<br/>Incremental fetching]
+        end
+
+        subgraph "REST Path (New)"
+            RestImpl --> RStmt[StatementExecutionStatement]
+            RStmt --> 
RFetch[StatementExecutionResultFetcher<br/>Manifest-based]
+        end
+
+        subgraph "Shared CloudFetch Pipeline"
+            TFetch --> Queue[Download Queue]
+            RFetch --> Queue
+            Queue --> DM[CloudFetchDownloadManager]
+            DM --> Down[CloudFetchDownloader]
+            DM --> Mem[MemoryBufferManager]
+            DM --> Reader[CloudFetchReader<br/>REUSED!]
+        end
+
+        Reader --> Arrow[Arrow Record Batches]
+    end
+
+    subgraph "Databricks Platform"
+        ThriftImpl --> HS2[HiveServer2 Thrift]
+        RestImpl --> SEAPI[Statement Execution API]
+        Down --> Storage[Cloud Storage<br/>S3/Azure/GCS]
+    end
+
+    style ThriftImpl fill:#ffe6e6
+    style RestImpl fill:#ccffcc
+    style DM fill:#ccccff
+    style Down fill:#ccccff
+    style Mem fill:#ccccff
+```
+
+## Background
+
+### Current Implementation (Thrift Protocol)
+- Uses Apache Hive Server 2 (HS2) Thrift protocol over HTTP
+- Inherits from `SparkHttpConnection` and `SparkStatement`
+- Supports CloudFetch for large result sets via Thrift's `DownloadResult` 
capability
+- Direct results for small result sets via Thrift's `GetDirectResults`
+- Complex HTTP handler chain: tracing → retry → OAuth → token exchange
+
+### Statement Execution API
+- RESTful HTTP API using JSON/Arrow formats
+- Endpoints:
+  - **Session Management**:
+    - `POST /api/2.0/sql/sessions` - Create session
+    - `DELETE /api/2.0/sql/sessions/{session_id}` - Delete session
+  - **Statement Execution**:
+    - `POST /api/2.0/sql/statements` - Execute statement
+    - `GET /api/2.0/sql/statements/{statement_id}` - Get statement 
status/results
+    - `GET /api/2.0/sql/statements/{statement_id}/result/chunks/{chunk_index}` 
- Get result chunk
+    - `POST /api/2.0/sql/statements/{statement_id}/cancel` - Cancel statement
+    - `DELETE /api/2.0/sql/statements/{statement_id}` - Close statement
+
+### Key Advantages of Statement Execution API
+1. **Simpler Protocol**: Standard REST/JSON vs complex Thrift binary protocol
+2. **Better Performance**: Optimized for large result sets with presigned 
S3/Azure URLs
+3. **Modern Authentication**: Built for OAuth 2.0 and service principals
+4. **Flexible Disposition**: INLINE (≤25 MiB), EXTERNAL_LINKS (≤100 GiB), or 
INLINE_OR_EXTERNAL_LINKS (hybrid)
+5. **Session Support**: Explicit session management with session-level 
configuration
+
+## Design Goals
+
+1. **Backward Compatibility**: Existing Thrift-based code continues to work
+2. **Configuration-Driven**: Users choose protocol via connection parameters
+3. **Code Reuse**: Leverage existing CloudFetch prefetch pipeline for 
EXTERNAL_LINKS
+4. **Minimal Duplication**: Share authentication, tracing, retry logic
+5. **ADBC Compliance**: Maintain full ADBC API compatibility
+
+## Architecture
+
+### High-Level Components
+
+```mermaid
+graph TB
+    DC[DatabricksConnection]
+
+    DC --> PS[Protocol Selection]
+    DC --> Common[Common Components:<br/>Authentication, HTTP 
Client,<br/>Tracing, Retry]
+
+    PS --> Thrift[Thrift Path<br/>Existing]
+    PS --> REST[Statement Execution API Path<br/>New]
+
+    Thrift --> THC[SparkHttpConnection]
+    Thrift --> TClient[TCLIService Client]
+    Thrift --> TStmt[DatabricksStatement<br/>Thrift]
+    Thrift --> TCF[CloudFetch<br/>Thrift-based]
+
+    REST --> RConn[StatementExecutionConnection]

Review Comment:
   both statement and connection will have different implementations between 
thrift and SEA, such as how to execute query and how to get metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(csharp/src/Drivers/Databricks): Design of SEA support for Databricks C# driver [arrow-adbc]

Reply via email to