This is an automated email from the ASF dual-hosted git repository. beto pushed a commit to branch explorable in repository https://gitbox.apache.org/repos/asf/superset.git
commit 922698c91bac3c2f5eb46d795d37ac907715841a Author: Beto Dealmeida <[email protected]> AuthorDate: Wed Oct 15 11:39:59 2025 -0400 Add docs --- chart-request-flow.md | 359 +++++++++++ query-object-documentation.md | 1299 ++++++++++++++++++++++++++++++++++++++++ semantic-layer-architecture.md | 1264 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 2922 insertions(+) diff --git a/chart-request-flow.md b/chart-request-flow.md new file mode 100644 index 0000000000..cc1edd27d6 --- /dev/null +++ b/chart-request-flow.md @@ -0,0 +1,359 @@ +# Chart Data Request Flow in Apache Superset + +This document traces the complete path of a chart data request through the Superset backend, from API endpoint to database query and back. + +## Overview + +When a client requests chart data (e.g., loading a histogram chart), the request flows through multiple layers: + +1. API Endpoint +2. Schema Validation/Parsing +3. Command Pattern (Business Logic) +4. Query Context Processing +5. Database Execution +6. Post-Processing +7. Response Formatting + +## Detailed Flow + +### 1. Entry Point: API Endpoint + +**File**: `superset/charts/data/api.py:187` + +**Endpoint**: `POST /api/v1/chart/data` + +The request hits `ChartDataRestApi.data()` method which: +- Parses the JSON body from the request +- Creates a `QueryContext` object from the form data via `ChartDataQueryContextSchema` +- Creates a `ChartDataCommand` to execute the query +- Validates and executes the command + +```python +def data(self) -> Response: + json_body = request.json + query_context = self._create_query_context_from_form(json_body) + command = ChartDataCommand(query_context) + command.validate() + return self._get_data_response(command, ...) +``` + +### 2. Schema Layer: Request Parsing + +**File**: `superset/charts/schemas.py:1384` + +`ChartDataQueryContextSchema.load()` deserializes the request into: + +**QueryContext object** (the main container): +- datasource: Database table/query info +- queries: List of query objects +- result_format: JSON/CSV/XLSX +- result_type: FULL/SAMPLES/QUERY/etc +- force: Whether to bypass cache + +**List of QueryObject instances** (one per query in the request): +- columns: Columns to select (e.g., ["age"]) +- metrics: Aggregations to compute +- filters: WHERE clause filters +- post_processing: Client-side transformations (e.g., histogram with bins=25) + +### 3. Command Pattern: Business Logic + +**File**: `superset/commands/chart/data/get_data_command.py:39` + +`ChartDataCommand.run()` orchestrates the execution: + +```python +def run(self, **kwargs: Any) -> dict[str, Any]: + payload = self._query_context.get_payload( + cache_query_context=cache_query_context, + force_cached=force_cached + ) + + for query in payload["queries"]: + if query.get("error"): + raise ChartDataQueryFailedError(query["error"]) + + return { + "query_context": self._query_context, + "queries": payload["queries"] + } +``` + +### 4. Query Context Processor: Core Execution + +**File**: `superset/common/query_context_processor.py:1052` + +`QueryContextProcessor.get_payload()`: +- Iterates through each `QueryObject` in `query_context.queries` +- For each query, calls `get_query_results()` which routes based on result_type: + - `FULL` → `_get_full()` → `get_df_payload()` + - `SAMPLES` → `_get_samples()` + - `QUERY` → `_get_query()` + +**File**: `superset/common/query_context_processor.py:128` + +`QueryContextProcessor.get_df_payload()`: + +1. **Generate cache key** from query object +2. **Check cache** using `QueryCacheManager` +3. **If cache miss**: + - Validate columns exist in datasource + - Call `get_query_result(query_obj)` to execute SQL + - Get annotation data if needed + - Cache the result with appropriate timeout +4. **Return payload** with DataFrame and metadata + +```python +def get_df_payload(self, query_obj, force_cached=False): + cache_key = self.query_cache_key(query_obj) + timeout = self.get_cache_timeout() + cache = QueryCacheManager.get(key=cache_key, ...) + + if not cache.is_loaded: + query_result = self.get_query_result(query_obj) + annotation_data = self.get_annotation_data(query_obj) + cache.set_query_result(...) + + return { + "cache_key": cache_key, + "df": cache.df, + "query": cache.query, + "is_cached": cache.is_cached, + ... + } +``` + +### 5. Database Query Execution + +**File**: `superset/common/query_context_processor.py:267` + +`QueryContextProcessor.get_query_result()`: + +```python +def get_query_result(self, query_object: QueryObject) -> QueryResult: + # Execute SQL query on the datasource + result = query_context.datasource.query(query_object.to_dict()) + df = result.df + + # Normalize timestamps to pandas datetime format + if not df.empty: + df = self.normalize_df(df, query_object) + + # Handle time offset comparisons if specified + if query_object.time_offsets: + time_offsets = self.processing_time_offsets(df, query_object) + df = time_offsets["df"] + + # Apply post-processing operations + df = query_object.exec_post_processing(df) + + result.df = df + return result +``` + +The `datasource.query()` call goes to your database connector (e.g., `SqlaTable.query()`) which: +- Converts the QueryObject dict to SQL using SQLAlchemy +- Executes the query via database engine +- Returns a `QueryResult` with a pandas DataFrame + +### 6. Post-Processing + +**File**: `superset/common/query_object.py:484` + +`QueryObject.exec_post_processing()`: +- Applies operations from `post_processing` list in sequence +- Each operation is a pandas transformation (e.g., pivot, aggregate, histogram) +- Uses functions from `superset.utils.pandas_postprocessing` + +Example for histogram: +```python +def exec_post_processing(self, df: DataFrame) -> DataFrame: + for post_process in self.post_processing: + operation = post_process.get("operation") # "histogram" + options = post_process.get("options", {}) # {column: "age", bins: 25} + df = getattr(pandas_postprocessing, operation)(df, **options) + return df +``` + +### 7. Response Formatting + +**File**: `superset/charts/data/api.py:346` + +`ChartDataRestApi._send_chart_response()`: +- Takes the result dict from command +- Formats based on `result_format`: + - **JSON**: Converts DataFrame to list of dicts + - **CSV**: Converts to CSV string + - **XLSX**: Converts to Excel binary +- Returns Flask Response with appropriate headers + +```python +def _send_chart_response(self, result, form_data=None, datasource=None): + result_format = result["query_context"].result_format + + if result_format == ChartDataResultFormat.JSON: + queries = result["queries"] + response_data = json.dumps( + {"result": queries}, + default=json.json_int_dttm_ser, + ignore_nan=True, + ) + resp = make_response(response_data, 200) + resp.headers["Content-Type"] = "application/json; charset=utf-8" + return resp +``` + +## Key Objects and Data Structures + +### QueryContext + +**File**: `superset/common/query_context.py:41` + +The main container for a chart data request. + +```python +{ + datasource: BaseDatasource, # Dataset (e.g., id=19, type="table") + queries: list[QueryObject], # List of queries to execute + result_type: ChartDataResultType, # "full", "samples", "query", etc. + result_format: ChartDataResultFormat, # "json", "csv", "xlsx" + force: bool, # Bypass cache flag + form_data: dict, # Original form_data from client + custom_cache_timeout: int | None # Override cache timeout +} +``` + +### QueryObject + +**File**: `superset/common/query_object.py:79` + +Represents a single database query. + +```python +{ + columns: list[Column], # Columns to select ["age"] + metrics: list[Metric] | None, # Aggregations to compute + filters: list[FilterClause], # WHERE clause filters + extras: dict[str, Any], # Additional query options + post_processing: list[dict], # Client-side transformations + row_limit: int | None, # LIMIT clause + row_offset: int, # OFFSET clause + order_desc: bool, # Sort direction + time_range: str | None, # Time filter range + granularity: str | None, # Temporal grouping column + annotation_layers: list[dict], # Annotations to overlay + from_dttm: datetime | None, # Computed time range start + to_dttm: datetime | None # Computed time range end +} +``` + +### QueryResult + +**File**: `superset/models/helpers.py` + +Returned from `datasource.query()`. + +```python +{ + df: pd.DataFrame, # The data from database + query: str, # Executed SQL query + from_dttm: datetime, # Time range start + to_dttm: datetime, # Time range end + error: str | None, # Error message if failed + status: QueryStatus # success, failed, etc. +} +``` + +## Example Request Flow + +For a histogram chart request like: + +```bash +curl 'https://example.com/api/v1/chart/data' \ + -H 'content-type: application/json' \ + --data-raw '{ + "datasource":{"id":19,"type":"table"}, + "queries":[{ + "columns":["age"], + "filters":[{ + "col":"time_start", + "op":"TEMPORAL_RANGE", + "val":"No filter" + }], + "row_limit":10000, + "post_processing":[{ + "operation":"histogram", + "options":{"column":"age","bins":25} + }] + }], + "result_format":"json", + "result_type":"full" + }' +``` + +### Flow Summary + +``` +Client Request (curl) + ↓ +ChartDataRestApi.data() + ↓ (parses JSON) +ChartDataQueryContextSchema.load() + ↓ (creates objects) +QueryContext + [QueryObject] + ↓ +ChartDataCommand.run() + ↓ +QueryContextProcessor.get_payload() + ↓ (for each QueryObject) +get_query_results() → _get_full() + ↓ +get_df_payload() + ├→ Check Cache (QueryCacheManager) + └→ get_query_result() + ├→ datasource.query() → Build SQL → Execute → pandas DataFrame + ├→ normalize_df() → Timestamp normalization + └→ exec_post_processing() → Apply histogram operation + ↓ +Return payload {df, query, metadata} + ↓ +_send_chart_response() + ↓ (format as JSON) +Flask Response → Client +``` + +## Architecture Patterns + +The codebase follows clean separation of concerns: + +1. **API Layer** (`superset/charts/data/api.py`): Handles HTTP requests/responses +2. **Schema Layer** (`superset/charts/schemas.py`): Validates and deserializes input +3. **Command Layer** (`superset/commands/`): Orchestrates business logic +4. **Query Context/Processor** (`superset/common/`): Manages execution and caching +5. **Query Object**: Represents individual database queries +6. **Datasource Layer** (`superset/connectors/`): Database abstraction and SQL generation + +### Key Benefits + +- **Caching**: Results cached at multiple levels (query result, query context) +- **Security**: Access control enforced via `raise_for_access()` +- **Flexibility**: Supports multiple result types and formats +- **Post-processing**: Client-side transformations without re-querying database +- **Time Comparison**: Built-in support for time offset queries +- **Annotations**: Overlay additional data layers on charts + +## Caching Strategy + +**File**: `superset/common/utils/query_cache_manager.py` + +Cache keys are generated from: +- Query object (columns, metrics, filters, etc.) +- Datasource UID +- RLS (Row Level Security) rules +- User context (if per-user caching enabled) +- Time range (using relative time strings, not absolute timestamps) + +This ensures that: +- Same query returns cached results +- Different users see appropriate cached data +- Time-relative queries (e.g., "Last 7 days") cache correctly diff --git a/query-object-documentation.md b/query-object-documentation.md new file mode 100644 index 0000000000..8769a0efa6 --- /dev/null +++ b/query-object-documentation.md @@ -0,0 +1,1299 @@ +# QueryObject Documentation + +**File**: `superset/common/query_object.py` + +The `QueryObject` class represents a single database query specification constructed on the client. It describes *what* data to fetch (columns, metrics, filters) without specifying *how* to fetch it. The datasource connector is responsible for translating the QueryObject into the appropriate query language (SQL, GraphQL, etc.). + +## Table of Contents + +- [Core Query Specification](#core-query-specification) +- [Data Selection](#data-selection) +- [Filtering](#filtering) +- [Aggregation & Metrics](#aggregation--metrics) +- [Time/Temporal](#timetemporal) +- [Sorting & Limiting](#sorting--limiting) +- [Series Limiting (Timeseries)](#series-limiting-timeseries) +- [Post-Processing](#post-processing) +- [Annotations](#annotations) +- [Query Execution Control](#query-execution-control) +- [Deprecated Fields](#deprecated-fields) + +--- + +## Core Query Specification + +### `datasource` + +**Type**: `BaseDatasource | None` + +**Description**: Reference to the datasource (dataset, table, or query) from which to fetch data. This is the data source object itself, not just an identifier. + +**Default**: `None` + +**Usage**: Set automatically by the QueryContext when loading from the schema. The datasource provides metadata about available columns, metrics, and handles the actual query execution. + +**Example**: +```python +# Automatically set during query context creation +query_object.datasource # <SqlaTable 'public.sales_data'> +``` + +--- + +### `columns` + +**Type**: `list[Column]` + +**Column Type**: `Union[AdhocColumn, str]` + +**Description**: List of dimensions (non-aggregated columns) to select in the query. These become GROUP BY columns if metrics are specified, or simple SELECT columns otherwise. + +**Default**: `[]` (empty list) + +**Formats Supported**: +1. **String reference**: `"country"` - references a physical column +2. **Adhoc column** (dict): + ```python + { + "label": "upper_name", + "sqlExpression": "UPPER(name)", + "hasCustomLabel": True + } + ``` + +**Usage Notes**: +- Formula annotations don't count as columns (filtered out) +- Used as the default for `series_columns` in timeseries queries +- Referenced in GROUP BY clauses when metrics are present + +**Related**: +- `column_names` property: Returns list of column labels as strings +- `series_columns`: Subset of columns used for series limiting + +**Example**: +```python +query_object.columns = ["country", "state", "city"] + +# Or with adhoc columns: +query_object.columns = [ + "country", + { + "label": "year", + "sqlExpression": "EXTRACT(YEAR FROM created_at)", + "hasCustomLabel": True + } +] +``` + +--- + +### `metrics` + +**Type**: `list[Metric] | None` + +**Metric Type**: `Union[AdhocMetric, str]` + +**Description**: List of aggregate expressions to compute. These become the SELECT clause aggregations in SQL queries. + +**Default**: `None` + +**Formats Supported**: +1. **String reference**: `"count"` - references a predefined metric +2. **Legacy format**: `{"label": "count"}` - references a predefined metric (converted to string) +3. **Adhoc SIMPLE metric**: + ```python + { + "expressionType": "SIMPLE", + "aggregate": "SUM", + "column": {"column_name": "revenue"}, + "label": "Total Revenue", + "hasCustomLabel": True + } + ``` +4. **Adhoc SQL metric**: + ```python + { + "expressionType": "SQL", + "sqlExpression": "SUM(price * quantity)", + "label": "Total Sales", + "hasCustomLabel": True + } + ``` + +**Usage Notes**: +- When metrics are specified, queries automatically include GROUP BY +- When `None` or empty, no aggregation is performed +- Legacy format `{"label": "..."}` is automatically converted to string + +**Related**: +- `metric_names` property: Returns list of metric labels as strings +- `is_rowcount`: Alternative to metrics for counting rows + +**Example**: +```python +# Simple metric references +query_object.metrics = ["count", "sum__revenue"] + +# Adhoc metrics +query_object.metrics = [ + { + "expressionType": "SIMPLE", + "aggregate": "AVG", + "column": {"column_name": "price"}, + "label": "Average Price" + }, + { + "expressionType": "SQL", + "sqlExpression": "SUM(CASE WHEN status='completed' THEN 1 ELSE 0 END)", + "label": "Completed Orders" + } +] +``` + +--- + +## Data Selection + +### `is_timeseries` + +**Type**: `bool` + +**Description**: Indicates whether this query is a timeseries query (data points over time). Affects how results are processed and displayed. + +**Default**: Automatically determined - `True` if `DTTM_ALIAS` (special time column constant) is in `columns`, otherwise can be explicitly set + +**Usage Notes**: +- Influences `series_columns` initialization +- Affects post-processing behavior in some visualizations +- Used to determine if series limiting should apply + +**Example**: +```python +# Automatically set to True for timeseries +query_object.columns = ["__timestamp", "country"] +query_object.is_timeseries # True + +# Explicitly set +query_object.is_timeseries = True +``` + +--- + +### `is_rowcount` + +**Type**: `bool` + +**Description**: When `True`, returns only the total row count instead of actual data. Used for preview operations or checking data volume. + +**Default**: `False` + +**Usage Notes**: +- Mutually exclusive with normal metric aggregation +- Result contains just a count, no actual rows +- Useful for pagination and data size checks + +**Example**: +```python +query_object.is_rowcount = True +# Query returns: {"count": 15234} +``` + +--- + +### `result_type` + +**Type**: `ChartDataResultType | None` + +**Values**: `"full"`, `"samples"`, `"query"`, `"results"`, `"post_processed"`, `"columns"`, `"timegrains"`, `"drill_detail"` + +**Description**: Specifies what type of result to return. Controls which processing pipeline is used. + +**Default**: `None` (inherits from QueryContext if not specified) + +**Types**: +- `"full"`: Complete query execution with all data +- `"samples"`: Sample rows from the datasource (no metrics, limited rows) +- `"query"`: Return the query string without executing +- `"results"`: Like "full" but with minimal metadata +- `"post_processed"`: Full results for client-side post-processing +- `"columns"`: Return column metadata only +- `"timegrains"`: Return available time granularities +- `"drill_detail"`: Return drill-through detail rows + +**Example**: +```python +query_object.result_type = ChartDataResultType.SAMPLES +# Returns sample rows without aggregation +``` + +--- + +## Filtering + +### `filter` + +**Type**: `list[QueryObjectFilterClause]` + +**FilterClause Type**: +```python +{ + "col": Union[str, AdhocColumn], # Column to filter + "op": str, # Operator (e.g., "==", "IN", "LIKE") + "val": FilterValues | None, # Value(s) to compare + "grain": str | None, # Time grain for temporal filters + "isExtra": bool | None # Added by dashboard filters +} +``` + +**Description**: List of filter conditions to apply to the query. These become WHERE clause conditions in SQL. + +**Default**: `[]` (empty list) + +**Supported Operators**: `==`, `!=`, `>`, `<`, `>=`, `<=`, `IN`, `NOT IN`, `LIKE`, `ILIKE`, `REGEX`, `TEMPORAL_RANGE`, `IS NULL`, `IS NOT NULL`, and more (see `FilterOperator` enum) + +**Usage Notes**: +- Filters are ANDed together +- `isExtra: True` indicates filter was added by dashboard/native filters +- Temporal filters (`TEMPORAL_RANGE`) require `grain` to be set +- Filters go through Jinja template processing if datasource supports it +- Sanitized for SQL injection before execution + +**Example**: +```python +query_object.filter = [ + { + "col": "country", + "op": "IN", + "val": ["USA", "Canada", "Mexico"] + }, + { + "col": "revenue", + "op": ">=", + "val": 1000 + }, + { + "col": "created_at", + "op": "TEMPORAL_RANGE", + "val": "Last 30 days", + "grain": "P1D" + } +] +``` + +--- + +### `extras` + +**Type**: `dict[str, Any]` + +**Description**: Additional query parameters and modifiers. This is an extensible dictionary for extra query options. + +**Default**: `{}` (empty dict) + +**Common Keys**: +- `"where"`: Raw SQL WHERE clause (added via AND) +- `"having"`: Raw SQL HAVING clause for aggregate filters +- `"time_grain_sqla"`: Time granularity (e.g., `"P1D"` for 1 day) +- `"relative_start"`: Start reference point (`"today"` or `"now"`) +- `"relative_end"`: End reference point (`"today"` or `"now"`) +- `"instant_time_comparison_range"`: For advanced time comparison features + +**Usage Notes**: +- WHERE and HAVING clauses support Jinja templates +- All SQL clauses are sanitized for security +- `time_grain_sqla` controls temporal aggregation level +- **Warning**: Direct SQL clauses (`where`, `having`) bypass some security layers + +**Example**: +```python +query_object.extras = { + "where": "status = 'active' AND archived = false", + "having": "SUM(revenue) > 10000", + "time_grain_sqla": "P1W", # Weekly granularity + "relative_start": "now" +} +``` + +--- + +### `apply_fetch_values_predicate` + +**Type**: `bool` + +**Description**: When `True`, applies additional WHERE clause predicates defined in the datasource configuration for fetching filter values. + +**Default**: `False` + +**Usage Notes**: +- Used primarily when loading filter options +- Applies datasource-specific filter predicates +- Helps limit the domain of filter values + +**Example**: +```python +query_object.apply_fetch_values_predicate = True +# Applies any predicates configured on the datasource +``` + +--- + +### `applied_time_extras` + +**Type**: `dict[str, str]` + +**Description**: Mapping of temporal extras that have been applied to the query. Used for tracking which time filters were actually used. + +**Default**: `{}` (empty dict) + +**Usage Notes**: +- Populated during query execution +- Used for displaying which time filters are active +- Keys are typically time column names +- Values are human-readable descriptions + +**Example**: +```python +query_object.applied_time_extras = { + "__time_range": "1 year ago : now", + "__time_grain": "P1D" +} +``` + +--- + +## Aggregation & Metrics + +### `groupby` (DEPRECATED) + +**Status**: ⚠️ **DEPRECATED** - Use `columns` instead + +**Type**: N/A (automatically renamed to `columns`) + +**Description**: Legacy field name for grouping columns. Automatically converted to `columns` during initialization. + +**Migration**: Replace `groupby` with `columns` in all new code. + +--- + +## Time/Temporal + +### `granularity` + +**Type**: `str | None` + +**Description**: Name of the temporal column to use for time-based operations (filtering, grouping). This is the primary time dimension for the query. + +**Default**: `None` + +**Usage Notes**: +- Used for temporal filtering and aggregation +- Essential for timeseries queries +- References a datetime column in the datasource +- Used in time range filters + +**Related**: +- `extras["time_grain_sqla"]`: Controls temporal aggregation granularity +- `time_range`: The time range to filter by +- `granularity_sqla`: Deprecated alias for `granularity` + +**Example**: +```python +query_object.granularity = "order_date" +# All time operations will use the order_date column +``` + +--- + +### `granularity_sqla` (DEPRECATED) + +**Status**: ⚠️ **DEPRECATED** - Use `granularity` instead + +**Type**: N/A (automatically renamed to `granularity`) + +**Description**: Legacy SQL-specific field name for temporal column. Automatically converted during initialization. + +**Migration**: Replace `granularity_sqla` with `granularity` in all new code. + +--- + +### `time_range` + +**Type**: `str | None` + +**Description**: Human-readable time range specification for filtering temporal data. Supports both relative and absolute formats. + +**Default**: `None` + +**Supported Formats**: +- **Relative**: `"Last 7 days"`, `"Last week"`, `"Last month"`, `"Last quarter"`, `"Last year"` +- **Relative with number**: `"Last 30 days"`, `"Last 6 months"`, `"Next 2 weeks"` +- **Absolute**: `"2023-01-01 : 2023-12-31"` (ISO 8601) +- **Mixed**: `"2023-01-01 : now"`, `"1 year ago : now"` +- **No filter**: `"No filter"` + +**Usage Notes**: +- Parsed into `from_dttm` and `to_dttm` datetime objects +- Relative times are resolved at query execution time +- Used for cache key generation (not the parsed datetime values) +- Supports parsedatetime syntax + +**Related**: +- `from_dttm`: Computed start datetime +- `to_dttm`: Computed end datetime +- `granularity`: Column to filter on + +**Example**: +```python +query_object.time_range = "Last 30 days" +# Computed at runtime: from_dttm = now() - 30 days, to_dttm = now() + +query_object.time_range = "2023-01-01 : 2023-06-30" +# Explicit range +``` + +--- + +### `from_dttm` + +**Type**: `datetime | None` + +**Description**: Computed start datetime for the time range filter. Automatically calculated from `time_range`. + +**Default**: `None` + +**Usage Notes**: +- Set automatically during query processing +- Not included in cache key (time_range is used instead) +- May be overridden for time offset queries + +**Example**: +```python +query_object.time_range = "Last 7 days" +# After processing: +query_object.from_dttm # datetime(2024, 1, 15, 0, 0, 0) +``` + +--- + +### `to_dttm` + +**Type**: `datetime | None` + +**Description**: Computed end datetime for the time range filter. Automatically calculated from `time_range`. + +**Default**: `None` + +**Usage Notes**: +- Set automatically during query processing +- Not included in cache key (time_range is used instead) +- May be overridden for time offset queries + +**Example**: +```python +query_object.time_range = "Last 7 days" +# After processing: +query_object.to_dttm # datetime(2024, 1, 22, 23, 59, 59) +``` + +--- + +### `inner_from_dttm` + +**Type**: `datetime | None` + +**Description**: Inner time range start for nested temporal operations. Used when applying time filters to queries that don't have time as a dimension. + +**Default**: `None` + +**Usage Notes**: +- Used in advanced time comparison scenarios +- Typically same as `from_dttm` for simple queries +- May differ when time offsets are applied + +**Example**: +```python +# Set during time offset processing +query_object.inner_from_dttm = datetime(2024, 1, 1) +``` + +--- + +### `inner_to_dttm` + +**Type**: `datetime | None` + +**Description**: Inner time range end for nested temporal operations. Used when applying time filters to queries that don't have time as a dimension. + +**Default**: `None` + +**Usage Notes**: +- Used in advanced time comparison scenarios +- Typically same as `to_dttm` for simple queries +- May differ when time offsets are applied + +**Example**: +```python +# Set during time offset processing +query_object.inner_to_dttm = datetime(2024, 1, 31) +``` + +--- + +### `time_shift` + +**Type**: `str | None` + +**Description**: Shifts the entire time range by a specified offset. Used for comparing data across different time periods. + +**Default**: `None` + +**Supported Formats**: Any parsedatetime-compatible string (e.g., `"1 week ago"`, `"3 months ago"`, `"1 year ago"`) + +**Usage Notes**: +- Applied to both `from_dttm` and `to_dttm` +- Different from `time_offsets` (which creates separate queries) +- Affects the main query time range + +**Example**: +```python +query_object.time_range = "Last 7 days" +query_object.time_shift = "1 week ago" +# Shifts the entire 7-day window back by 1 week +``` + +--- + +### `time_offsets` + +**Type**: `list[str]` + +**Description**: List of time offsets for creating comparison queries. Each offset generates an additional query with shifted time ranges, enabling time-over-time comparisons. + +**Default**: `[]` (empty list) + +**Supported Formats**: +1. **Relative**: `"1 week ago"`, `"1 year ago"`, `"3 months ago"` +2. **Date range** (with feature flag): `"2023-01-01 : 2023-01-31"` +3. **Special**: `"inherit"` (uses the time range span) +4. **Custom date**: `"2023-06-15"` (compares to this specific date) + +**Usage Notes**: +- Creates separate queries for each offset +- Results are joined with the main query results +- Offset metrics are renamed (e.g., `revenue` → `revenue__1 week ago`) +- Requires `DATE_RANGE_TIMESHIFTS_ENABLED` feature flag for date range format +- Used for year-over-year, month-over-month comparisons + +**Related**: +- Time grain is required for proper joining of offset results +- `series_limit` applies to the main query, not offset queries + +**Example**: +```python +query_object.time_offsets = ["1 week ago", "1 year ago"] +# Generates 3 queries total: +# 1. Main query (this week) +# 2. Same query shifted back 1 week +# 3. Same query shifted back 1 year +# Results joined on time + other dimensions + +# Or with date range (requires feature flag): +query_object.time_offsets = ["2023-01-01 : 2023-01-31"] +# Compares current period to January 2023 +``` + +--- + +## Sorting & Limiting + +### `orderby` + +**Type**: `list[OrderBy]` + +**OrderBy Type**: `tuple[Union[Metric, Column], bool]` + +**Description**: List of ordering specifications. Each tuple contains a column/metric and a boolean indicating ascending order. + +**Default**: `[]` (empty list) + +**Format**: `[(column_or_metric, is_ascending), ...]` + +**Usage Notes**: +- Boolean `True` = ascending order +- Boolean `False` = descending order +- Can order by both columns and metrics +- Applied after aggregation + +**Related**: +- `order_desc`: Default sort direction (deprecated in favor of explicit orderby) + +**Example**: +```python +query_object.orderby = [ + ("revenue", False), # Order by revenue descending + ("country", True) # Then by country ascending +] +``` + +--- + +### `order_desc` + +**Type**: `bool` + +**Description**: Default sort direction when orderby is not specified. Primarily affects how series are ordered. + +**Default**: `True` (descending) + +**Usage Notes**: +- Less flexible than `orderby` +- Prefer using explicit `orderby` for complex sorting +- Used mainly for backward compatibility + +**Example**: +```python +query_object.order_desc = False +# Results sorted in ascending order +``` + +--- + +### `row_limit` + +**Type**: `int | None` + +**Description**: Maximum number of rows to return from the query. Acts as a SQL LIMIT clause. + +**Default**: `None` (uses system default) + +**Range**: `>= 0` (0 means no limit) + +**Usage Notes**: +- Applied after all filtering and aggregation +- Different from `series_limit` (which limits timeseries) +- System default from `config["ROW_LIMIT"]` +- Can be overridden per-query + +**Related**: +- `row_offset`: Works with row_limit for pagination +- `series_limit`: For limiting timeseries/series count + +**Example**: +```python +query_object.row_limit = 1000 +# Returns at most 1000 rows +``` + +--- + +### `row_offset` + +**Type**: `int` + +**Description**: Number of rows to skip before returning results. Acts as a SQL OFFSET clause for pagination. + +**Default**: `0` (no offset) + +**Range**: `>= 0` + +**Usage Notes**: +- Used with `row_limit` for pagination +- Applied after ordering +- Useful for infinite scroll or paginated tables + +**Example**: +```python +# Page 3 of results (20 per page) +query_object.row_limit = 20 +query_object.row_offset = 40 # Skip first 40 rows +``` + +--- + +## Series Limiting (Timeseries) + +### `series_columns` + +**Type**: `list[Column]` + +**Description**: Subset of `columns` to use when limiting the number of series in timeseries queries. Defines which dimensions create distinct series. + +**Default**: Automatically initialized based on context: +- If explicitly provided, uses those columns +- If `is_timeseries=True` and `metrics` exist, uses all `columns` +- Otherwise, empty list + +**Usage Notes**: +- All series_columns must be present in `columns` +- Used with `series_limit` to control series count +- Validated during query validation +- Creates one series per unique combination of series_column values + +**Related**: +- `series_limit`: Maximum number of series +- `series_limit_metric`: Metric to use for ranking series +- `group_others_when_limit_reached`: Whether to group remaining series + +**Example**: +```python +query_object.columns = ["country", "product", "date"] +query_object.series_columns = ["country", "product"] +# Creates one series per (country, product) combination +``` + +--- + +### `series_limit` + +**Type**: `int` + +**Description**: Maximum number of series to return in a timeseries query. Series are ranked by `series_limit_metric` and top N are kept. + +**Default**: `0` (no limit) + +**Usage Notes**: +- Only applies to timeseries queries +- Requires `series_columns` and `series_limit_metric` to be set +- Top series are selected by the specified metric +- See `group_others_when_limit_reached` for handling excluded series + +**Related**: +- `series_columns`: Dimensions that define series +- `series_limit_metric`: Metric used for ranking +- `timeseries_limit`: Deprecated alias + +**Example**: +```python +query_object.series_limit = 10 +query_object.series_limit_metric = "revenue" +query_object.series_columns = ["country"] +# Returns top 10 countries by revenue +``` + +--- + +### `series_limit_metric` + +**Type**: `Metric | None` + +**Description**: The metric to use for ranking series when `series_limit` is applied. Determines which series are kept. + +**Default**: `None` + +**Usage Notes**: +- Required when `series_limit` is set +- Must be one of the metrics in the query +- Series are ranked in descending order by this metric +- Can be a string reference or adhoc metric + +**Related**: +- `series_limit`: Number of series to keep +- `timeseries_limit_metric`: Deprecated alias + +**Example**: +```python +query_object.series_limit_metric = "sum__revenue" +# Ranks series by total revenue +``` + +--- + +### `group_others_when_limit_reached` + +**Type**: `bool` + +**Description**: When `True` and series limit is reached, groups all remaining series into an "Others" category. Prevents incomplete data visualization. + +**Default**: `False` + +**Usage Notes**: +- Only relevant when `series_limit` is set +- Aggregates metrics for excluded series +- Helps show complete totals while limiting series count +- "Others" category appears as a separate series + +**Example**: +```python +query_object.series_limit = 5 +query_object.series_limit_metric = "count" +query_object.group_others_when_limit_reached = True +# Shows top 5 series + "Others" category with remaining aggregated +``` + +--- + +### `timeseries_limit` (DEPRECATED) + +**Status**: ⚠️ **DEPRECATED** - Use `series_limit` instead + +**Type**: N/A (automatically renamed to `series_limit`) + +**Description**: Legacy field name for series limit. Automatically converted during initialization. + +**Migration**: Replace `timeseries_limit` with `series_limit` in all new code. + +--- + +### `timeseries_limit_metric` (DEPRECATED) + +**Status**: ⚠️ **DEPRECATED** - Use `series_limit_metric` instead + +**Type**: N/A (automatically renamed to `series_limit_metric`) + +**Description**: Legacy field name for series limit metric. Automatically converted during initialization. + +**Migration**: Replace `timeseries_limit_metric` with `series_limit_metric` in all new code. + +--- + +## Post-Processing + +### `post_processing` + +**Type**: `list[dict[str, Any]]` + +**Description**: Ordered list of post-processing operations to apply to the query results. These transformations run on the DataFrame after SQL execution. + +**Default**: `[]` (empty list) + +**Operation Format**: +```python +{ + "operation": str, # Operation name (from pandas_postprocessing module) + "options": dict # Operation-specific parameters +} +``` + +**Available Operations**: +- `aggregate`: Group and aggregate data +- `pivot`: Pivot table transformation +- `rolling`: Rolling window calculations +- `sort`: Sort data +- `select`: Select/rename columns +- `contribution`: Calculate contribution percentages +- `prophet`: Time series forecasting +- `boxplot`: Statistical boxplot calculations +- `histogram`: Create histogram bins +- `geohash_decode`: Decode geohash to lat/lon +- `geohash_encode`: Encode lat/lon to geohash +- `geodetic_parse`: Parse geodetic coordinates +- And more... + +**Usage Notes**: +- Operations applied in sequence +- Each operation receives output of previous operation +- All operations are from `superset.utils.pandas_postprocessing` +- Validation ensures operation names are valid +- Useful for client-side transformations without re-querying + +**Example**: +```python +query_object.post_processing = [ + { + "operation": "pivot", + "options": { + "index": ["country"], + "columns": ["product"], + "aggregates": { + "revenue": {"operator": "sum"} + } + } + }, + { + "operation": "sort", + "options": { + "columns": {"revenue": False} # Descending + } + } +] +``` + +**Histogram Example**: +```python +query_object.post_processing = [ + { + "operation": "histogram", + "options": { + "column": "age", + "bins": 25 + } + } +] +``` + +--- + +## Annotations + +### `annotation_layers` + +**Type**: `list[dict[str, Any]]` + +**Description**: List of annotation layers to overlay on the chart. Annotations add contextual information like events, ranges, or reference data. + +**Default**: `[]` (empty list, formula annotations filtered out) + +**Layer Structure**: +```python +{ + "annotationType": str, # "FORMULA", "NATIVE", "line", "table" + "name": str, # Layer name + "value": Any, # Layer-specific value (ID, formula, etc.) + "show": bool, # Whether to show the layer + "sourceType": str, # "NATIVE", "line", "table", etc. + "color": str, # Layer color + "opacity": str, # "opacityLow", "opacityMedium", "opacityHigh" + "style": str, # "solid", "dashed", "dotted", "longDashed" + "width": float, # Line width + "showMarkers": bool, # Show markers on line annotations + "showLabel": bool, # Always show label + "hideLine": bool, # Hide line (show markers only) + "timeColumn": str, # Column with timestamps + "intervalEndColumn": str, # For interval annotations + "titleColumn": str, # Column for titles + "descriptionColumns": list, # Columns for descriptions + "overrides": dict # Override query properties +} +``` + +**Annotation Types**: +- `"FORMULA"`: Simple formula overlays (e.g., constant line) - **filtered out, don't affect query** +- `"NATIVE"`: Native Superset annotations stored in DB +- `"line"` / `"table"`: Annotations from other charts/queries + +**Usage Notes**: +- Formula annotations are filtered out (don't affect payload) +- Other annotations trigger additional data fetches +- Annotations can override time range, granularity, etc. +- Used primarily for time series visualizations + +**Example**: +```python +query_object.annotation_layers = [ + { + "annotationType": "NATIVE", + "name": "Important Events", + "value": 1, # annotation_layer_id + "show": True, + "sourceType": "NATIVE", + "color": "#ff0000" + }, + { + "annotationType": "line", + "name": "Baseline", + "value": 42, # chart_id + "show": True, + "sourceType": "line", + "style": "dashed", + "overrides": { + "time_range": "Last year" + } + } +] +``` + +--- + +## Query Execution Control + +### `url_params` + +**Type**: `dict[str, str]` (keys and values are strings) + +**Description**: Optional query parameters passed from dashboard or Explore view URLs. Used for dynamic filtering and Jinja template variables. + +**Default**: Not set (from kwargs) + +**Usage Notes**: +- Available in Jinja templates via `url_param` function +- Typically set by dashboard filters or URL parameters +- Keys and values are always strings +- Can be used for dynamic query customization + +**Example**: +```python +query_object.url_params = { + "country_filter": "USA", + "min_date": "2024-01-01" +} + +# In Jinja template: +# WHERE country = '{{ url_param("country_filter") }}' +``` + +--- + +## Deprecated Fields + +### `where` (DEPRECATED) + +**Status**: ⚠️ **DEPRECATED** - Use `extras["where"]` instead + +**Type**: Field moved to extras during initialization + +**Description**: Raw SQL WHERE clause. Automatically moved to `extras["where"]`. + +**Migration**: Use `extras["where"]` directly instead of passing `where` parameter. + +--- + +### `having` (DEPRECATED) + +**Status**: ⚠️ **DEPRECATED** - Use `extras["having"]` instead + +**Type**: Field moved to extras during initialization + +**Description**: Raw SQL HAVING clause. Automatically moved to `extras["having"]`. + +**Migration**: Use `extras["having"]` directly instead of passing `having` parameter. + +--- + +## Properties + +### `metric_names` + +**Type**: `list[str]` (read-only property) + +**Description**: Returns the metric labels as strings. Converts adhoc metrics to their labels. + +**Usage**: Accessing metric names without dealing with adhoc metric dictionaries. + +**Example**: +```python +query_object.metrics = ["count", {"expressionType": "SQL", "label": "revenue", ...}] +query_object.metric_names # ["count", "revenue"] +``` + +--- + +### `column_names` + +**Type**: `list[str]` (read-only property) + +**Description**: Returns the column labels as strings. Converts adhoc columns to their labels. + +**Usage**: Accessing column names without dealing with adhoc column dictionaries. + +**Example**: +```python +query_object.columns = ["country", {"label": "year", "sqlExpression": "...", ...}] +query_object.column_names # ["country", "year"] +``` + +--- + +## Methods + +### `validate(raise_exceptions: bool = True) -> QueryObjectValidationError | None` + +**Description**: Validates the query object for correctness. Checks for duplicate labels, missing series columns, invalid time offsets, and sanitizes filters. + +**Parameters**: +- `raise_exceptions`: If `True`, raises exception on validation error. If `False`, returns the error object. + +**Validates**: +1. No missing series columns +2. No duplicate column/metric labels +3. Valid time offset configurations +4. Sanitizes WHERE/HAVING clauses for SQL injection + +**Returns**: `None` if valid, or `QueryObjectValidationError` if `raise_exceptions=False` + +**Example**: +```python +try: + query_object.validate() +except QueryObjectValidationError as e: + print(f"Validation failed: {e.message}") +``` + +--- + +### `to_dict() -> dict[str, Any]` + +**Description**: Serializes the QueryObject to a dictionary. Used for passing to datasource connectors and caching. + +**Returns**: Dictionary representation of the query object + +**Usage**: Called internally when executing queries on datasources. + +**Example**: +```python +query_dict = query_object.to_dict() +# { +# "columns": ["country"], +# "metrics": ["count"], +# "filters": [...], +# ... +# } +``` + +--- + +### `cache_key(**extra: Any) -> str` + +**Description**: Generates a unique cache key for this query. Uses MD5/SHA hash of query parameters. + +**Parameters**: +- `**extra`: Additional key-value pairs to include in cache key + +**Key Components**: +- All query parameters from `to_dict()` +- `time_range` (not `from_dttm`/`to_dttm` - for relative time caching) +- `datasource.uid` +- `result_type` +- `post_processing` +- `time_offsets` +- `annotation_layers` (if present) +- User impersonation key (if enabled) + +**Returns**: MD5/SHA hash string + +**Example**: +```python +cache_key = query_object.cache_key( + time_offset="1 week ago", + time_grain="P1D" +) +# "a3f5c8e9d..." +``` + +--- + +### `exec_post_processing(df: DataFrame) -> DataFrame` + +**Description**: Applies all post-processing operations to a DataFrame in sequence. + +**Parameters**: +- `df`: Pandas DataFrame from query execution + +**Returns**: Transformed DataFrame + +**Raises**: `InvalidPostProcessingError` if operation is invalid + +**Example**: +```python +df = datasource.query(query_object.to_dict()).df +processed_df = query_object.exec_post_processing(df) +``` + +--- + +## Type Definitions + +### `Column` + +**Type**: `Union[AdhocColumn, str]` + +**AdhocColumn Structure**: +```python +{ + "label": str, # Column label + "sqlExpression": str, # SQL expression + "hasCustomLabel": bool, # Whether label is custom + "columnType": str, # "BASE_AXIS" or "SERIES" + "timeGrain": str # Time grain if temporal +} +``` + +--- + +### `Metric` + +**Type**: `Union[AdhocMetric, str]` + +**AdhocMetric Structure**: +```python +{ + "expressionType": "SIMPLE" | "SQL", + "label": str, + "hasCustomLabel": bool, + + # For SIMPLE: + "aggregate": "SUM" | "AVG" | "COUNT" | "MIN" | "MAX" | "COUNT_DISTINCT", + "column": AdhocMetricColumn, + + # For SQL: + "sqlExpression": str +} +``` + +--- + +### `OrderBy` + +**Type**: `tuple[Union[Metric, Column], bool]` + +**Format**: `(column_or_metric, is_ascending)` + +**Example**: `("revenue", False)` means "ORDER BY revenue DESC" + +--- + +## Common Usage Patterns + +### Basic Query + +```python +query_object = QueryObject( + columns=["country", "city"], + metrics=["count"], + row_limit=100 +) +``` + +### Timeseries Query + +```python +query_object = QueryObject( + columns=["__timestamp", "country"], + metrics=["sum__revenue"], + granularity="order_date", + time_range="Last 30 days", + extras={"time_grain_sqla": "P1D"}, + is_timeseries=True +) +``` + +### Time Comparison Query + +```python +query_object = QueryObject( + columns=["country"], + metrics=["sum__revenue"], + time_range="Last 7 days", + time_offsets=["1 week ago", "1 year ago"], + granularity="order_date" +) +``` + +### Post-Processed Query + +```python +query_object = QueryObject( + columns=["age"], + row_limit=10000, + post_processing=[ + { + "operation": "histogram", + "options": {"column": "age", "bins": 25} + } + ] +) +``` + +### Filtered and Sorted Query + +```python +query_object = QueryObject( + columns=["country", "product"], + metrics=["sum__revenue", "count"], + filter=[ + {"col": "status", "op": "==", "val": "completed"}, + {"col": "revenue", "op": ">=", "val": 100} + ], + orderby=[("sum__revenue", False)], # Descending + row_limit=50 +) +``` diff --git a/semantic-layer-architecture.md b/semantic-layer-architecture.md new file mode 100644 index 0000000000..7725f475a5 --- /dev/null +++ b/semantic-layer-architecture.md @@ -0,0 +1,1264 @@ +# Semantic Layer Architecture for Superset + +## Overview + +This document outlines the architecture for introducing semantic layers to Superset as siblings to databases. Semantic layers will contain "explorables" (semantic views) that can be explored to create charts, similar to how datasets/queries work today. + +## Goals + +1. Make semantic layers first-class citizens alongside databases +2. Define a common protocol (`Explorable`) for both datasets and semantic views +3. Minimize code changes by bifurcating at the right abstraction level +4. Maintain backward compatibility +5. Enable future extensibility for other data source types + +## Recommended Bifurcation Point + +### Primary: The `datasource.query()` Interface + +**File**: `superset/common/query_context_processor.py:267-279` + +This is the critical junction where we should bifurcate: + +```python +def get_query_result(self, query_object: QueryObject) -> QueryResult: + query = "" + if isinstance(query_context.datasource, Query): + result = query_context.datasource.exc_query(query_object.to_dict()) + else: + result = query_context.datasource.query(query_object.to_dict()) # ← KEY LINE + + df = result.df + # ... normalization and post-processing + return result +``` + +### Why This Is The Perfect Bifurcation Point + +1. **Clean interface**: The contract is simple: `query(query_obj: dict) -> QueryResult` +2. **Connector-agnostic**: The `QueryObject` describes *what* to fetch (columns, metrics, filters), not *how* +3. **Everything downstream is already polymorphic**: Post-processing, caching, and formatting all work with DataFrames +4. **Everything upstream is declarative**: Schema parsing and validation don't care about the data source type +5. **Minimal code changes**: Most of the codebase already treats datasources polymorphically + +## Implementation Strategy + +### 1. Define the `Explorable` Protocol + +Create a new file: `superset/explorables/base.py` + +```python +""" +Base protocol for explorable data sources in Superset. + +An "explorable" is any data source that can be explored to create charts, +including SQL datasets, saved queries, and semantic layer views. +""" +from __future__ import annotations + +from typing import Any, Protocol, runtime_checkable + +from superset.models.helpers import QueryResult + + +@runtime_checkable +class Explorable(Protocol): + """ + Protocol for objects that can be explored to create charts. + + This protocol is implemented by: + - BaseDatasource (SQL datasets and queries) + - SemanticView (semantic layer views) + - Future: GraphQL endpoints, REST APIs, etc. + """ + + # ========================================================================= + # Core Query Interface + # ========================================================================= + + def query(self, query_obj: dict[str, Any]) -> QueryResult: + """ + Execute a query and return results as a DataFrame. + + :param query_obj: Dictionary describing the query (columns, metrics, filters, etc.) + :return: QueryResult containing DataFrame and metadata + """ + ... + + def get_query_str(self, query_obj: dict[str, Any]) -> str: + """ + Get the query string without executing. + + Used for display in the UI and debugging. + + :param query_obj: Dictionary describing the query + :return: String representation of the query (SQL, GraphQL, etc.) + """ + ... + + # ========================================================================= + # Metadata Interface + # ========================================================================= + + @property + def uid(self) -> str: + """ + Unique identifier for this explorable. + + Used for caching and security checks. + Format: "{type}_{id}" (e.g., "table_123", "semantic_view_456") + """ + ... + + @property + def type(self) -> str: + """ + Type discriminator for this explorable. + + Examples: 'table', 'query', 'semantic_view', 'cube_view' + """ + ... + + @property + def column_names(self) -> list[str]: + """ + List of available column names. + + Used for validation and autocomplete. + """ + ... + + @property + def columns(self) -> list[Any]: + """ + List of column metadata objects. + + Each object should have at minimum: + - column_name: str + - type: str (data type) + - is_dttm: bool (whether it's a datetime column) + """ + ... + + def get_column(self, column_name: str) -> Any: + """ + Get metadata for a specific column. + + :param column_name: Name of the column + :return: Column metadata object or None if not found + """ + ... + + @property + def data(self) -> dict[str, Any]: + """ + Additional metadata about this explorable. + + May include: + - verbose_map: Dict mapping column names to verbose names + - description: Human-readable description + - etc. + """ + ... + + # ========================================================================= + # Caching Interface + # ========================================================================= + + @property + def cache_timeout(self) -> int | None: + """ + Default cache timeout in seconds. + + Returns None to use system default. + """ + ... + + def get_extra_cache_keys(self, query_obj: dict[str, Any]) -> list[Any]: + """ + Additional cache key components specific to this explorable. + + Used to ensure cache invalidation when the explorable's + underlying data or configuration changes. + + :param query_obj: The query being executed + :return: List of additional values to include in cache key + """ + ... + + @property + def changed_on(self) -> Any: + """ + Last modification timestamp. + + Used for cache invalidation. + """ + ... + + # ========================================================================= + # Time/Date Handling + # ========================================================================= + + @property + def offset(self) -> int: + """ + Timezone offset for datetime columns. + + Used to normalize datetime values to user's timezone. + Returns 0 for UTC. + """ + ... + + # ========================================================================= + # Security Interface + # ========================================================================= + + @property + def perm(self) -> str: + """ + Permission string for this explorable. + + Used by security manager to check access. + Format depends on type (e.g., "[database].[schema].[table]") + """ + ... + + @property + def schema_perm(self) -> str | None: + """ + Schema-level permission string. + + Optional; used for schema-level access control. + """ + ... +``` + +### 2. Key Files to Update + +#### A. QueryContext Type Annotation + +**File**: `superset/common/query_context.py:41` + +```python +from __future__ import annotations + +from superset.explorables.base import Explorable + +class QueryContext: + """ + The query context contains the query object and additional fields necessary + to retrieve the data payload for a given viz. + """ + + cache_type: ClassVar[str] = "df" + enforce_numerical_metrics: ClassVar[bool] = True + + datasource: Explorable # ← Changed from BaseDatasource + slice_: Slice | None = None + queries: list[QueryObject] + form_data: dict[str, Any] | None + result_type: ChartDataResultType + result_format: ChartDataResultFormat + force: bool + custom_cache_timeout: int | None + + cache_values: dict[str, Any] + + _processor: QueryContextProcessor + + def __init__( + self, + *, + datasource: Explorable, # ← Changed from BaseDatasource + queries: list[QueryObject], + slice_: Slice | None, + form_data: dict[str, Any] | None, + result_type: ChartDataResultType, + result_format: ChartDataResultFormat, + force: bool = False, + custom_cache_timeout: int | None = None, + cache_values: dict[str, Any], + ) -> None: + self.datasource = datasource + self.slice_ = slice_ + self.result_type = result_type + self.result_format = result_format + self.queries = queries + self.form_data = form_data + self.force = force + self.custom_cache_timeout = custom_cache_timeout + self.cache_values = cache_values + self._processor = QueryContextProcessor(self) +``` + +#### B. QueryContextProcessor Type Annotation + +**File**: `superset/common/query_context_processor.py:112` + +```python +from superset.explorables.base import Explorable + +class QueryContextProcessor: + """ + The query context contains the query object and additional fields necessary + to retrieve the data payload for a given viz. + """ + + _query_context: QueryContext + _qc_datasource: Explorable # ← Changed from BaseDatasource + + def __init__(self, query_context: QueryContext): + self._query_context = query_context + self._qc_datasource = query_context.datasource +``` + +#### C. ChartDataQueryContextSchema + +**File**: `superset/charts/schemas.py:1384` + +This is where you'll load the appropriate explorable based on datasource type: + +```python +from superset.common.query_context_factory import QueryContextFactory +from superset.explorables.loaders import load_explorable + +class ChartDataQueryContextSchema(Schema): + query_context_factory: QueryContextFactory | None = None + datasource = fields.Nested(ChartDataDatasourceSchema) + queries = fields.List(fields.Nested(ChartDataQueryObjectSchema)) + custom_cache_timeout = fields.Integer( + metadata={"description": "Override the default cache timeout"}, + required=False, + allow_none=True, + ) + + force = fields.Boolean( + metadata={ + "description": "Should the queries be forced to load from the source. " + "Default: `false`" + }, + allow_none=True, + ) + + result_type = fields.Enum(ChartDataResultType, by_value=True) + result_format = fields.Enum(ChartDataResultFormat, by_value=True) + + form_data = fields.Raw(allow_none=True, required=False) + + @post_load + def make_query_context(self, data: dict[str, Any], **kwargs: Any) -> QueryContext: + # Load the appropriate explorable based on datasource type + datasource_info = data.get("datasource", {}) + explorable = load_explorable(datasource_info) + + data["datasource"] = explorable + query_context = self.get_query_context_factory().create(**data) + return query_context + + def get_query_context_factory(self) -> QueryContextFactory: + if self.query_context_factory is None: + from superset.common.query_context_factory import QueryContextFactory + self.query_context_factory = QueryContextFactory() + return self.query_context_factory +``` + +#### D. Explorable Loader + +Create a new file: `superset/explorables/loaders.py` + +```python +""" +Utilities for loading explorables from datasource specifications. +""" +from __future__ import annotations + +from typing import Any + +from flask_babel import gettext as _ + +from superset.daos.exceptions import DatasourceNotFound +from superset.explorables.base import Explorable +from superset.utils.core import DatasourceType + + +def load_explorable(datasource_spec: dict[str, Any]) -> Explorable: + """ + Load an explorable from a datasource specification. + + :param datasource_spec: Dictionary with 'id' and 'type' keys + :return: An Explorable instance (BaseDatasource or SemanticView) + :raises DatasourceNotFound: If the explorable doesn't exist + """ + datasource_id = datasource_spec.get("id") + datasource_type = datasource_spec.get("type") + + if not datasource_id or not datasource_type: + raise DatasourceNotFound( + _("Datasource specification must include 'id' and 'type'") + ) + + # Handle semantic views + if datasource_type == "semantic_view": + return _load_semantic_view(datasource_id) + + # Handle traditional datasets/queries + if datasource_type in (DatasourceType.TABLE.value, DatasourceType.QUERY.value): + return _load_dataset(datasource_id, datasource_type) + + raise DatasourceNotFound( + _("Unknown datasource type: %(type)s", type=datasource_type) + ) + + +def _load_dataset(datasource_id: int | str, datasource_type: str) -> Explorable: + """ + Load a traditional SQL dataset or query. + + Uses existing Superset logic. + """ + from superset.connectors.sqla.models import SqlaTable + from superset.daos.datasource import DatasourceDAO + from superset.models.sql_lab import Query + + # Convert string ID to int if needed + if isinstance(datasource_id, str): + # Handle UUID format if needed + try: + datasource_id = int(datasource_id) + except ValueError: + # Might be a UUID, let the DAO handle it + pass + + datasource = DatasourceDAO.get_datasource( + datasource_type=datasource_type, + datasource_id=datasource_id, + ) + + if not datasource: + raise DatasourceNotFound( + _( + "Datasource %(type)s:%(id)s not found", + type=datasource_type, + id=datasource_id, + ) + ) + + return datasource + + +def _load_semantic_view(view_id: int | str) -> Explorable: + """ + Load a semantic layer view. + + :param view_id: ID of the semantic view + :return: SemanticView instance + """ + from superset.semantic_layers.dao import SemanticViewDAO + + # Convert string ID to int if needed + if isinstance(view_id, str): + try: + view_id = int(view_id) + except ValueError: + # Might be a UUID + pass + + view = SemanticViewDAO.find_by_id(view_id) + + if not view: + raise DatasourceNotFound( + _("Semantic view %(id)s not found", id=view_id) + ) + + return view +``` + +### 3. Semantic Layer Implementation + +#### A. Database Models + +Create a new file: `superset/semantic_layers/models.py` + +```python +""" +SQLAlchemy models for semantic layers. + +Semantic layers are siblings to databases, providing access to +pre-defined metrics and dimensions from external semantic layer tools +like Cube.js, dbt metrics, Looker, etc. +""" +from __future__ import annotations + +import json +import logging +from typing import Any + +import pandas as pd +from flask_appbuilder import Model +from sqlalchemy import Column, ForeignKey, Integer, String, Text +from sqlalchemy.orm import relationship + +from superset.explorables.base import Explorable +from superset.models.helpers import AuditMixinNullable, ImportExportMixin, QueryResult + +logger = logging.getLogger(__name__) + + +class SemanticLayer(Model, AuditMixinNullable, ImportExportMixin): + """ + Connection to a semantic layer (sibling to Database). + + Examples: + - Cube.js deployment + - dbt Cloud with metrics + - Looker instance + - AtScale cube + """ + + __tablename__ = "semantic_layers" + + id = Column(Integer, primary_key=True) + name = Column(String(255), nullable=False, unique=True) + description = Column(Text) + + # Type of semantic layer + connection_type = Column(String(50), nullable=False) + # Options: 'cube', 'dbt', 'looker', 'atscale', etc. + + # JSON configuration for connecting to the semantic layer + connection_params = Column(Text, nullable=False) + # Example for Cube.js: + # { + # "url": "https://my-cube.cloud", + # "api_secret": "...", + # "api_token": "..." + # } + + # Cache configuration + cache_timeout = Column(Integer) + + # Relationships + views = relationship( + "SemanticView", + back_populates="semantic_layer", + cascade="all, delete-orphan", + ) + + def __repr__(self) -> str: + return f"<SemanticLayer {self.name}>" + + @property + def connection_config(self) -> dict[str, Any]: + """Parse connection params as dictionary.""" + try: + return json.loads(self.connection_params or "{}") + except json.JSONDecodeError: + logger.error(f"Invalid JSON in connection_params for {self.name}") + return {} + + def get_client(self) -> Any: + """ + Get a client for this semantic layer. + + Returns the appropriate client based on connection_type: + - 'cube' -> CubeClient + - 'dbt' -> DbtClient + - etc. + """ + from superset.semantic_layers.clients import get_client_for_type + + return get_client_for_type( + connection_type=self.connection_type, + config=self.connection_config, + ) + + +class SemanticView(Model, AuditMixinNullable, ImportExportMixin, Explorable): + """ + A view (cube, metric group, etc.) in a semantic layer. + + This is the semantic layer equivalent of a Dataset. + It can be explored to create charts. + """ + + __tablename__ = "semantic_views" + + id = Column(Integer, primary_key=True) + name = Column(String(255), nullable=False) + description = Column(Text) + + # Foreign key to parent semantic layer + semantic_layer_id = Column(Integer, ForeignKey("semantic_layers.id"), nullable=False) + + # External identifier in the semantic layer system + # (e.g., cube name in Cube.js, model name in dbt) + external_id = Column(String(500), nullable=False) + + # Cached metadata from semantic layer + # Updated periodically via background job + metadata_cache = Column(Text) + # Example structure: + # { + # "dimensions": [ + # {"name": "user_id", "type": "number", "title": "User ID"}, + # {"name": "created_at", "type": "time", "title": "Created At"} + # ], + # "measures": [ + # {"name": "count", "type": "count", "title": "Count"}, + # {"name": "total_revenue", "type": "sum", "title": "Total Revenue"} + # ] + # } + + # Cache configuration (overrides semantic layer default) + cache_timeout = Column(Integer) + + # Relationships + semantic_layer = relationship("SemanticLayer", back_populates="views") + + def __repr__(self) -> str: + return f"<SemanticView {self.name}>" + + # ========================================================================= + # Explorable Protocol Implementation + # ========================================================================= + + def query(self, query_obj: dict[str, Any]) -> QueryResult: + """ + Execute a query against the semantic layer and return results. + + This method: + 1. Translates the Superset QueryObject to semantic layer query format + 2. Executes the query via the semantic layer client + 3. Converts the response to a pandas DataFrame + 4. Returns a QueryResult + """ + from superset.semantic_layers.translator import translate_query + + # Get the semantic layer client + client = self.semantic_layer.get_client() + + # Translate Superset query to semantic layer query + semantic_query = translate_query( + query_obj=query_obj, + view=self, + connection_type=self.semantic_layer.connection_type, + ) + + # Execute the query + try: + response = client.execute_query( + view_id=self.external_id, + query=semantic_query, + ) + + # Convert to DataFrame + df = self._response_to_dataframe(response) + + # Get query string for display + query_str = client.get_query_string( + view_id=self.external_id, + query=semantic_query, + ) + + result = QueryResult( + df=df, + query=query_str, + from_dttm=query_obj.get("from_dttm"), + to_dttm=query_obj.get("to_dttm"), + ) + + return result + + except Exception as ex: + logger.exception(f"Error querying semantic view {self.name}") + + # Return error result + result = QueryResult( + df=pd.DataFrame(), + query="", + error=str(ex), + ) + return result + + def get_query_str(self, query_obj: dict[str, Any]) -> str: + """Get the query string without executing.""" + from superset.semantic_layers.translator import translate_query + + client = self.semantic_layer.get_client() + + semantic_query = translate_query( + query_obj=query_obj, + view=self, + connection_type=self.semantic_layer.connection_type, + ) + + return client.get_query_string( + view_id=self.external_id, + query=semantic_query, + ) + + @property + def uid(self) -> str: + """Unique identifier for caching.""" + return f"semantic_view_{self.id}" + + @property + def type(self) -> str: + """Type discriminator.""" + return "semantic_view" + + @property + def column_names(self) -> list[str]: + """List of available dimensions and measures.""" + metadata = self._get_metadata() + + dimensions = [d["name"] for d in metadata.get("dimensions", [])] + measures = [m["name"] for m in metadata.get("measures", [])] + + return dimensions + measures + + @property + def columns(self) -> list[dict[str, Any]]: + """Column metadata objects.""" + metadata = self._get_metadata() + + columns = [] + + # Add dimensions + for dim in metadata.get("dimensions", []): + columns.append({ + "column_name": dim["name"], + "verbose_name": dim.get("title", dim["name"]), + "type": dim.get("type", "STRING"), + "is_dttm": dim.get("type") == "time", + }) + + # Add measures + for measure in metadata.get("measures", []): + columns.append({ + "column_name": measure["name"], + "verbose_name": measure.get("title", measure["name"]), + "type": measure.get("type", "NUMBER"), + "is_dttm": False, + }) + + return columns + + def get_column(self, column_name: str) -> dict[str, Any] | None: + """Get metadata for a specific column.""" + for col in self.columns: + if col["column_name"] == column_name: + return col + return None + + @property + def data(self) -> dict[str, Any]: + """Additional metadata.""" + metadata = self._get_metadata() + + # Create verbose map + verbose_map = {} + for col in self.columns: + verbose_map[col["column_name"]] = col["verbose_name"] + + return { + "verbose_map": verbose_map, + "description": self.description, + "metadata": metadata, + } + + @property + def cache_timeout(self) -> int | None: + """Cache timeout in seconds.""" + return self.cache_timeout or self.semantic_layer.cache_timeout + + def get_extra_cache_keys(self, query_obj: dict[str, Any]) -> list[Any]: + """Additional cache key components.""" + # Include the metadata version in cache key + # so cache invalidates when semantic layer schema changes + metadata = self._get_metadata() + metadata_version = metadata.get("version", "v1") + + return [metadata_version] + + @property + def offset(self) -> int: + """Timezone offset (default to UTC).""" + return 0 + + @property + def perm(self) -> str: + """Permission string.""" + return f"[{self.semantic_layer.name}].[{self.name}]" + + @property + def schema_perm(self) -> str | None: + """Schema-level permission.""" + return f"[{self.semantic_layer.name}]" + + # ========================================================================= + # Helper Methods + # ========================================================================= + + def _get_metadata(self) -> dict[str, Any]: + """Get cached metadata, parsing from JSON.""" + try: + return json.loads(self.metadata_cache or "{}") + except json.JSONDecodeError: + logger.error(f"Invalid metadata cache for {self.name}") + return {} + + def _response_to_dataframe(self, response: Any) -> pd.DataFrame: + """ + Convert semantic layer response to pandas DataFrame. + + Format depends on the semantic layer type. + """ + # This will be implemented based on the specific + # semantic layer's response format + + if isinstance(response, pd.DataFrame): + return response + + if isinstance(response, dict) and "data" in response: + return pd.DataFrame(response["data"]) + + if isinstance(response, list): + return pd.DataFrame(response) + + logger.warning(f"Unexpected response format: {type(response)}") + return pd.DataFrame() + + def refresh_metadata(self) -> None: + """ + Refresh cached metadata from the semantic layer. + + Should be called periodically via background job. + """ + client = self.semantic_layer.get_client() + + try: + metadata = client.get_view_metadata(self.external_id) + self.metadata_cache = json.dumps(metadata) + + except Exception as ex: + logger.exception(f"Error refreshing metadata for {self.name}") + raise +``` + +#### B. Data Access Object (DAO) + +Create a new file: `superset/semantic_layers/dao.py` + +```python +""" +Data Access Object for semantic layers. +""" +from __future__ import annotations + +from typing import Any + +from superset.daos.base import BaseDAO +from superset.semantic_layers.models import SemanticLayer, SemanticView + + +class SemanticLayerDAO(BaseDAO): + model_cls = SemanticLayer + + +class SemanticViewDAO(BaseDAO): + model_cls = SemanticView + + @classmethod + def find_by_semantic_layer( + cls, semantic_layer_id: int + ) -> list[SemanticView]: + """Find all views for a given semantic layer.""" + return ( + cls.get_session() + .query(SemanticView) + .filter(SemanticView.semantic_layer_id == semantic_layer_id) + .all() + ) +``` + +#### C. Query Translator + +Create a new file: `superset/semantic_layers/translator.py` + +```python +""" +Translate Superset QueryObject to semantic layer query format. +""" +from __future__ import annotations + +from typing import Any, TYPE_CHECKING + +if TYPE_CHECKING: + from superset.semantic_layers.models import SemanticView + + +def translate_query( + query_obj: dict[str, Any], + view: SemanticView, + connection_type: str, +) -> dict[str, Any]: + """ + Translate Superset QueryObject to semantic layer query. + + :param query_obj: Superset query object dictionary + :param view: The semantic view being queried + :param connection_type: Type of semantic layer ('cube', 'dbt', etc.) + :return: Semantic layer query dictionary + """ + + # Route to appropriate translator based on connection type + if connection_type == "cube": + return _translate_to_cube(query_obj, view) + elif connection_type == "dbt": + return _translate_to_dbt(query_obj, view) + else: + raise ValueError(f"Unsupported connection type: {connection_type}") + + +def _translate_to_cube( + query_obj: dict[str, Any], + view: SemanticView, +) -> dict[str, Any]: + """ + Translate to Cube.js query format. + + Cube.js query format: + { + "dimensions": ["User.city", "User.state"], + "measures": ["User.count"], + "timeDimensions": [{ + "dimension": "Order.createdAt", + "dateRange": ["2020-01-01", "2020-12-31"] + }], + "filters": [{ + "member": "User.country", + "operator": "equals", + "values": ["US"] + }], + "limit": 1000 + } + """ + + cube_query: dict[str, Any] = {} + + # Map columns to dimensions + columns = query_obj.get("columns", []) + if columns: + cube_query["dimensions"] = [ + f"{view.external_id}.{col}" for col in columns + ] + + # Map metrics to measures + metrics = query_obj.get("metrics", []) + if metrics: + cube_query["measures"] = [ + f"{view.external_id}.{metric}" for metric in metrics + ] + + # Map filters + filters = query_obj.get("filter", []) + if filters: + cube_query["filters"] = [ + _translate_filter_to_cube(f, view) for f in filters + ] + + # Map time range + from_dttm = query_obj.get("from_dttm") + to_dttm = query_obj.get("to_dttm") + granularity = query_obj.get("granularity") + + if from_dttm and to_dttm and granularity: + cube_query["timeDimensions"] = [{ + "dimension": f"{view.external_id}.{granularity}", + "dateRange": [ + from_dttm.isoformat(), + to_dttm.isoformat(), + ] + }] + + # Map limit + row_limit = query_obj.get("row_limit") + if row_limit: + cube_query["limit"] = row_limit + + return cube_query + + +def _translate_filter_to_cube( + filter_obj: dict[str, Any], + view: SemanticView, +) -> dict[str, Any]: + """Translate a single filter to Cube.js format.""" + + # Map Superset operators to Cube.js operators + operator_map = { + "==": "equals", + "!=": "notEquals", + ">": "gt", + "<": "lt", + ">=": "gte", + "<=": "lte", + "IN": "equals", # Cube uses 'equals' with array values + "NOT IN": "notEquals", + "LIKE": "contains", + } + + col = filter_obj.get("col") + op = filter_obj.get("op") + val = filter_obj.get("val") + + cube_op = operator_map.get(op, "equals") + + return { + "member": f"{view.external_id}.{col}", + "operator": cube_op, + "values": [val] if not isinstance(val, list) else val, + } + + +def _translate_to_dbt( + query_obj: dict[str, Any], + view: SemanticView, +) -> dict[str, Any]: + """ + Translate to dbt metrics query format. + + (Implementation depends on dbt Semantic Layer API) + """ + # TODO: Implement based on dbt Semantic Layer spec + raise NotImplementedError("dbt translation not yet implemented") +``` + +### 4. Security Integration + +Update `superset/security/manager.py` to handle semantic view permissions: + +```python +def raise_for_access( + self, + *, + database: Database | None = None, + datasource: BaseDatasource | None = None, + query: Query | None = None, + query_context: QueryContext | None = None, + table: Table | None = None, + viz: BaseViz | None = None, +) -> None: + """ + Raise an exception if the user cannot access the resource. + + Updated to handle semantic views. + """ + + # Handle QueryContext (may contain semantic view) + if query_context: + datasource = query_context.datasource + + # Check if this is a semantic view + if datasource.type == "semantic_view": + self._raise_for_semantic_view_access(datasource) + return + + # ... existing logic for other cases + + +def _raise_for_semantic_view_access(self, semantic_view: Any) -> None: + """Check access for semantic views.""" + + # Check if user has permission to access this semantic view + if not self.can_access("datasource_access", semantic_view.perm): + raise SupersetSecurityException( + SupersetError( + error_type=SupersetErrorType.DATASOURCE_SECURITY_ACCESS_ERROR, + message=_("You don't have access to this semantic view"), + level=ErrorLevel.ERROR, + ) + ) +``` + +### 5. API Endpoints + +Create REST API endpoints for CRUD operations on semantic layers: + +**File**: `superset/semantic_layers/api.py` + +```python +""" +REST API for semantic layers. +""" +from flask_appbuilder.api import expose, protect, safe + +from superset.semantic_layers.dao import SemanticLayerDAO, SemanticViewDAO +from superset.semantic_layers.schemas import ( + SemanticLayerPostSchema, + SemanticLayerPutSchema, + SemanticViewPostSchema, +) +from superset.views.base_api import BaseSupersetModelRestApi + + +class SemanticLayerRestApi(BaseSupersetModelRestApi): + datamodel = SemanticLayerDAO + + resource_name = "semantic_layer" + allow_browser_login = True + + class_permission_name = "SemanticLayer" + + # Schemas + add_model_schema = SemanticLayerPostSchema() + edit_model_schema = SemanticLayerPutSchema() + + # ... standard CRUD endpoints + + +class SemanticViewRestApi(BaseSupersetModelRestApi): + datamodel = SemanticViewDAO + + resource_name = "semantic_view" + allow_browser_login = True + + class_permission_name = "SemanticView" + + # Schemas + add_model_schema = SemanticViewPostSchema() + + # ... standard CRUD endpoints + + @expose("/<int:pk>/refresh_metadata", methods=["POST"]) + @protect() + @safe + def refresh_metadata(self, pk: int): + """Refresh metadata cache for a semantic view.""" + view = self.datamodel.get(pk, self._base_filters) + if not view: + return self.response_404() + + try: + view.refresh_metadata() + return self.response(200, message="Metadata refreshed") + except Exception as ex: + return self.response_500(message=str(ex)) +``` + +## Files That Don't Need Changes + +Because the bifurcation happens at the right level, most code doesn't need changes: + +✅ **ChartDataCommand** - Works with QueryContext +✅ **Post-processing logic** - Works with DataFrames +✅ **Caching logic** - Uses `Explorable.uid` and `get_extra_cache_keys()` +✅ **Response formatting** - Works with DataFrames +✅ **QueryObject** - Already connector-agnostic +✅ **QueryContextProcessor.get_df_payload()** - Works with any explorable +✅ **All visualization plugins** - Work with data payloads + +## Files That Need Updates + +🔧 **Type annotations**: Change `BaseDatasource` → `Explorable` throughout +🔧 **Schema deserialization**: `ChartDataQueryContextSchema` needs to route to correct loader +🔧 **Security**: Add permission checks for semantic views in `SecurityManager` +🔧 **API endpoints**: New REST APIs for semantic layer CRUD +🔧 **Frontend**: New UI for creating/managing semantic layer connections + +## Migration Path + +### Phase 1: Create Protocol (No Breaking Changes) + +1. Create `superset/explorables/base.py` with `Explorable` protocol +2. Create `superset/explorables/loaders.py` with loader functions +3. Verify `BaseDatasource` implicitly satisfies the protocol +4. Add type stubs but don't enforce yet (use `# type: ignore` where needed) + +**Deliverable**: Protocol defined, no runtime changes + +### Phase 2: Implement Semantic Views + +1. Create database migrations for `semantic_layers` and `semantic_views` tables +2. Create `superset/semantic_layers/models.py` with `SemanticLayer` and `SemanticView` +3. Implement `Explorable` protocol for `SemanticView` +4. Create DAOs in `superset/semantic_layers/dao.py` +5. Create query translator in `superset/semantic_layers/translator.py` +6. Add schema deserialization logic in `ChartDataQueryContextSchema` +7. Add security/permissions for semantic views +8. Create REST API endpoints + +**Deliverable**: Semantic views work end-to-end, existing datasets unchanged + +### Phase 3: Gradual Type Migration + +1. Change `QueryContext.datasource` type from `BaseDatasource` to `Explorable` +2. Change `QueryContextProcessor._qc_datasource` type to `Explorable` +3. Update type hints in related files +4. Run `pre-commit run mypy` to catch type issues +5. Fix any type errors + +**Deliverable**: Full type safety, both paths use `Explorable` + +### Phase 4: Formalize Dataset Explorable (Optional) + +1. Make `BaseDatasource` explicitly implement `Explorable` (add to class signature) +2. Verify all protocol methods are implemented +3. Add runtime checks if needed +4. Update documentation + +**Deliverable**: Clean, explicit protocol implementation for all explorables + +## Key Advantages of This Approach + +1. **Minimal disruption**: The query execution path already works polymorphically +2. **Type safety**: Protocol ensures both implementations have the same interface +3. **Clean separation**: Semantic layer lives in its own module but integrates seamlessly +4. **Reuses everything**: Caching, post-processing, security, API responses all work unchanged +5. **Future-proof**: Easy to add more explorable types (GraphQL, REST APIs, etc.) +6. **No breaking changes**: Existing datasets continue to work exactly as before +7. **Testable**: Each component can be unit tested independently + +## Example: Adding a New Semantic Layer Type + +To add support for a new semantic layer (e.g., Looker): + +1. Add `'looker'` to supported connection types +2. Implement `LookerClient` in `superset/semantic_layers/clients/` +3. Add `_translate_to_looker()` in `translator.py` +4. That's it! Everything else reuses existing infrastructure + +## Testing Strategy + +### Unit Tests + +- Test `Explorable` protocol compliance for both `BaseDatasource` and `SemanticView` +- Test query translation for each semantic layer type +- Test metadata caching and refresh +- Test security permissions + +### Integration Tests + +- Test end-to-end chart creation using semantic views +- Test caching behavior +- Test error handling when semantic layer is unavailable +- Test post-processing operations + +### E2E Tests (Playwright) + +- Test UI for creating semantic layer connections +- Test chart creation from semantic views +- Test exploration workflow + +## Future Enhancements + +Once the `Explorable` protocol is established, it opens up many possibilities: + +1. **GraphQL endpoints** as explorables +2. **REST APIs** as explorables +3. **Spreadsheet files** (Google Sheets, Excel) as explorables +4. **OLAP cubes** as explorables +5. **Machine learning model outputs** as explorables + +All of these would implement the same `Explorable` protocol and integrate seamlessly with the existing infrastructure.
