eric-wang-1990 commented on code in PR #3636: URL: https://github.com/apache/arrow-adbc/pull/3636#discussion_r2471741837
########## csharp/src/Drivers/Databricks/Telemetry/telemetry-activity-based-design.md: ########## @@ -0,0 +1,1242 @@ +<!-- + + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + +--> + +# Databricks ADBC Driver: Activity-Based Telemetry Design + +## Executive Summary + +This document outlines an **Activity-based telemetry design** that leverages the existing Activity/ActivitySource infrastructure in the Databricks ADBC driver. Instead of creating a parallel telemetry system, we extend the current tracing infrastructure to collect metrics and export them to Databricks telemetry service. + +**Key Objectives:** +- Reuse existing Activity instrumentation points +- Add metrics collection without duplicating code +- Export aggregated metrics to Databricks service +- Maintain server-side feature flag control +- Preserve backward compatibility with OpenTelemetry + +**Design Principles:** +- **Build on existing infrastructure**: Leverage ActivitySource/ActivityListener +- **Single instrumentation point**: Don't duplicate tracing and metrics +- **Non-blocking**: All operations async and non-blocking +- **Privacy-first**: No PII or query data collected +- **Server-controlled**: Feature flag support for enable/disable + +--- + +## Table of Contents + +1. [Background & Motivation](#1-background--motivation) +2. [Architecture Overview](#2-architecture-overview) +3. [Core Components](#3-core-components) +4. [Data Collection](#4-data-collection) +5. [Export Mechanism](#5-export-mechanism) +6. [Configuration](#6-configuration) +7. [Privacy & Compliance](#7-privacy--compliance) +8. [Error Handling](#8-error-handling) +9. [Testing Strategy](#9-testing-strategy) +10. [Alternatives Considered](#10-alternatives-considered) +11. [Implementation Checklist](#11-implementation-checklist) +12. [Open Questions](#12-open-questions) +13. [References](#13-references) + +--- + +## 1. Background & Motivation + +### 1.1 Current State + +The Databricks ADBC driver already has: +- ✅ **ActivitySource**: `DatabricksAdbcActivitySource` +- ✅ **Activity instrumentation**: Connection, statement execution, result fetching +- ✅ **W3C Trace Context**: Distributed tracing support +- ✅ **ActivityTrace utility**: Helper for creating activities + +### 1.2 Design Opportunity + +The driver already has comprehensive Activity instrumentation for distributed tracing. This presents an opportunity to: +- Leverage existing Activity infrastructure for both tracing and metrics +- Avoid duplicate instrumentation points in the driver code +- Use a single data model (Activity) for both observability concerns +- Maintain automatic correlation between traces and metrics +- Reduce overall system complexity and maintenance burden + +### 1.3 The Approach + +**Extend Activity infrastructure** with metrics collection: +- ✅ Single instrumentation point (Activity) +- ✅ Custom ActivityListener for metrics aggregation +- ✅ Export aggregated data to Databricks service +- ✅ Reuse Activity context, correlation, and timing +- ✅ Seamless integration with OpenTelemetry ecosystem + +--- + +## 2. Architecture Overview + +### 2.1 High-Level Architecture + +```mermaid +graph TB + A[Driver Operations] -->|Activity.Start/Stop| B[ActivitySource] + B -->|Activity Events| C[DatabricksActivityListener] + C -->|Aggregate Metrics| D[MetricsAggregator] + D -->|Batch & Buffer| E[DatabricksTelemetryExporter] + E -->|HTTP POST| F[Databricks Service] + F --> G[Lumberjack] + + H[Feature Flag Service] -.->|Enable/Disable| C + + style C fill:#e1f5fe + style D fill:#e1f5fe + style E fill:#e1f5fe +``` + +**Key Components:** +1. **ActivitySource** (existing): Emits activities for all operations +2. **DatabricksActivityListener** (new): Listens to activities, extracts metrics +3. **MetricsAggregator** (new): Aggregates by statement, batches events +4. **DatabricksTelemetryExporter** (new): Exports to Databricks service + +### 2.2 Activity Flow + +```mermaid +sequenceDiagram + participant App as Application + participant Driver as DatabricksConnection + participant AS as ActivitySource + participant AL as ActivityListener + participant MA as MetricsAggregator + participant Ex as TelemetryExporter + participant Service as Databricks Service + + App->>Driver: ExecuteQueryAsync() + Driver->>AS: StartActivity("ExecuteQuery") + AS->>AL: ActivityStarted(activity) + + Driver->>Driver: Execute operation + Driver->>AS: activity.SetTag("result_format", "CloudFetch") + Driver->>AS: activity.AddEvent("ChunkDownload", tags) + + AS->>AL: ActivityStopped(activity) + AL->>MA: ProcessActivity(activity) + MA->>MA: Aggregate by statement_id + + alt Batch threshold reached + MA->>Ex: Flush(batch) + Ex->>Service: POST /telemetry-ext + end +``` + +--- + +## 3. Core Components + +### 3.1 DatabricksActivityListener + +**Purpose**: Listen to Activity events and extract metrics for Databricks telemetry. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.DatabricksActivityListener` + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// <summary> + /// Custom ActivityListener that aggregates metrics from Activity events + /// and exports them to Databricks telemetry service. + /// </summary> + public sealed class DatabricksActivityListener : IDisposable + { + public DatabricksActivityListener( + DatabricksConnection connection, + ITelemetryExporter exporter, + TelemetryConfiguration config); + + // Start listening to activities + public void Start(); + + // Stop listening and flush pending metrics + public Task StopAsync(); + + public void Dispose(); + } +} +``` + +#### Activity Listener Configuration + +```csharp +// Internal setup +private ActivityListener CreateListener() +{ + return new ActivityListener + { + ShouldListenTo = source => + source.Name == "Databricks.Adbc.Driver", + + ActivityStarted = OnActivityStarted, + ActivityStopped = OnActivityStopped, + + Sample = (ref ActivityCreationOptions<ActivityContext> options) => + _config.Enabled ? ActivitySamplingResult.AllDataAndRecorded + : ActivitySamplingResult.None + }; +} +``` + +#### Contracts + +**Activity Filtering**: +- Only listen to `"Databricks.Adbc.Driver"` ActivitySource +- Respects feature flag via `Sample` callback + +**Metric Extraction**: +- Extract metrics from Activity tags +- Aggregate by `statement_id` tag +- Aggregate by `session_id` tag + +**Non-Blocking**: +- All processing async +- Never blocks Activity completion +- Failures logged but don't propagate + +--- + +### 3.2 MetricsAggregator + +**Purpose**: Aggregate Activity data into metrics suitable for Databricks telemetry. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.MetricsAggregator` + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + /// <summary> + /// Aggregates metrics from activities by statement and session. + /// </summary> + internal sealed class MetricsAggregator : IDisposable + { + public MetricsAggregator( + ITelemetryExporter exporter, + TelemetryConfiguration config); + + // Process completed activity + public void ProcessActivity(Activity activity); + + // Mark statement complete and emit aggregated metrics + public void CompleteStatement(string statementId); + + // Flush all pending metrics + public Task FlushAsync(CancellationToken ct = default); + + public void Dispose(); + } +} +``` + +#### Aggregation Logic + +```mermaid +flowchart TD + A[Activity Stopped] --> B{Determine EventType} + B -->|Connection.Open*| C[Map to ConnectionOpen] + B -->|Statement.*| D[Map to StatementExecution] + B -->|error.type tag present| E[Map to Error] + + C --> F[Emit Connection Event Immediately] + D --> G[Aggregate by statement_id] + E --> H[Emit Error Event Immediately] + + G --> I{Statement Complete?} + I -->|Yes| J[Emit Aggregated Statement Event] + I -->|No| K[Continue Buffering] + + J --> L{Batch Size Reached?} + L -->|Yes| M[Flush Batch to Exporter] + L -->|No| K +``` + +**Key Behaviors:** +- **Connection events**: Emitted immediately (no aggregation needed) +- **Statement events**: Aggregated by `statement_id` until statement completes +- **Error events**: Emitted immediately +- **Child activities** (CloudFetch.Download, etc.): Metrics rolled up to parent statement activity + +#### Contracts + +**Statement Aggregation**: +- Activities with same `statement_id` tag aggregated together +- Aggregation includes: execution latency, chunk downloads, poll count +- Emitted when statement marked complete + +**Connection-Level Events**: +- Connection.Open emitted immediately +- Driver configuration collected once per connection + +**Error Handling**: +- Activity errors (tags with `error.type`) captured +- Never throws exceptions + +--- + +### 3.3 DatabricksTelemetryExporter + +**Purpose**: Export aggregated metrics to Databricks telemetry service. + +**Location**: `Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.DatabricksTelemetryExporter` + +#### Interface + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry +{ + public interface ITelemetryExporter + { + /// <summary> + /// Export metrics to Databricks service. Never throws. + /// </summary> + Task ExportAsync( + IReadOnlyList<TelemetryMetric> metrics, + CancellationToken ct = default); + } + + internal sealed class DatabricksTelemetryExporter : ITelemetryExporter + { + public DatabricksTelemetryExporter( + HttpClient httpClient, + DatabricksConnection connection, + TelemetryConfiguration config); + + public Task ExportAsync( + IReadOnlyList<TelemetryMetric> metrics, + CancellationToken ct = default); + } +} +``` + +**Same implementation as original design**: Circuit breaker, retry logic, endpoints. + +--- + +## 4. Data Collection + +### 4.1 Tag Definition System + +To ensure maintainability and explicit control over what data is collected and exported, all Activity tags are defined in a centralized tag definition system. + +#### Tag Definition Structure + +**Location**: `Telemetry/TagDefinitions/` + +``` +Telemetry/ +└── TagDefinitions/ + ├── TelemetryTag.cs # Tag metadata and annotations + ├── TelemetryEvent.cs # Event definitions with associated tags + ├── ConnectionOpenEvent.cs # Connection event tag definitions + ├── StatementExecutionEvent.cs # Statement event tag definitions + └── ErrorEvent.cs # Error event tag definitions +``` + +#### TelemetryTag Annotation + +**File**: `TagDefinitions/TelemetryTag.cs` + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TagDefinitions +{ + /// <summary> + /// Defines export scope for telemetry tags. + /// </summary> + [Flags] + internal enum TagExportScope + { + None = 0, + ExportLocal = 1, // Export to local diagnostics (file listener, etc.) + ExportDatabricks = 2, // Export to Databricks telemetry service + ExportAll = ExportLocal | ExportDatabricks + } + + /// <summary> + /// Attribute to annotate Activity tag definitions. + /// </summary> + [AttributeUsage(AttributeTargets.Field, AllowMultiple = false)] + internal sealed class TelemetryTagAttribute : Attribute + { + public string TagName { get; } + public TagExportScope ExportScope { get; set; } + public string? Description { get; set; } + public bool Required { get; set; } + + public TelemetryTagAttribute(string tagName) + { + TagName = tagName; + ExportScope = TagExportScope.ExportAll; + } + } +} +``` + +#### Event Tag Definitions + +**File**: `TagDefinitions/ConnectionOpenEvent.cs` + +```csharp +namespace Apache.Arrow.Adbc.Drivers.Databricks.Telemetry.TagDefinitions +{ + /// <summary> + /// Tag definitions for Connection.Open events. + /// </summary> + internal static class ConnectionOpenEvent Review Comment: Talked offline, there should not be too many event type name, let's just stick to this for now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
