shivamgoel commented on issue #37535:
URL: https://github.com/apache/superset/issues/37535#issuecomment-4115843954
## [SIP-199] Updated to incorporate SIP-182 (Semantic Layer support) changes
### Motivation
As Apache Superset moves toward a "Library-First" architecture (per SIP-187)
and seeks to centralize analytical logic (per SIP-185), the current API surface
remains heavily tied to the Visualization (Chart) layer.
Currently, external services or AI agents wishing to query a Dataset must
either:
1. Reference a slice_id (Chart), which forces a dependency on UI state.
2. Construct a complex query_context designed for frontend state management.
Additionally, the ongoing Semantic Layer work (SIP-182, PRs #37815-#38396)
introduces a new `SemanticView` datasource type with its own query model
(`SemanticQuery`) and mapper infrastructure. This creates an opportunity -- and
a risk. If SIP-199 is implemented in isolation for datasets only, Superset will
end up with two parallel query APIs for two datasource types. Instead, we
should build a single, unified headless query API that works across all
`Explorable` datasource types.
We need a Datasource-centric API that allows services to fetch data using
semantic definitions (metrics/dimensions) of any datasource -- whether a
traditional Dataset or a Semantic View -- treating Superset as a headless
semantic layer.
### Relationship to Semantic Layer Work
The semantic layer PRs (#37815-#38396) introduce:
* **`SemanticQuery`** -- a simplified query type with `metrics`,
`dimensions`, `filters`, `order`, `limit`, `offset`
* **`SemanticView`** -- a new datasource type implementing the `Explorable`
protocol
* **Mapper** (`superset/semantic_layers/mapper.py`) -- translates between
chart-oriented `QueryObject` and the simplified `SemanticQuery`
* **`Explorable` protocol** (`superset/explorables/base.py`) -- a shared
interface that both `BaseDatasource` (datasets) and `SemanticView` implement
SIP-199's simplified query payload is nearly identical in structure to
`SemanticQuery`. Rather than creating a parallel query model, this SIP adopts
`SemanticQuery` conventions as the unified query vocabulary for the headless
API.
### Proposed Change
I propose a new REST endpoint that works with any Explorable datasource:
**Primary endpoint:**
```
POST /api/v1/datasource/{type}/{id}/query
```
Where `type` is `table` (for datasets) or `semantic_view` (for semantic
views), matching the existing `DatasourceType` enum.
**Convenience aliases:**
```
POST /api/v1/dataset/{id}/query -> resolves to type=table
POST /api/v1/semantic_view/{id}/query -> resolves to type=semantic_view
```
This endpoint will accept a simplified JSON payload and execute it via a new
`DatasourceQueryCommand`. This command will dispatch to the appropriate
execution path based on datasource type:
* **For datasets (`table`):** Converts the simplified payload to a
`QueryObject` via `QueryObjectFactory`, then executes through the existing SQL
pipeline.
* **For semantic views (`semantic_view`):** The payload maps directly to
`SemanticQuery` and passes through the semantic layer provider (Snowflake, dbt,
etc.).
Key Architectural Principles:
* **Library-First:** The core `DatasourceQueryCommand` interface will reside
in `superset-core` (not depend on Flask request context), making it usable by
the SIP-187 MCP service and extensions.
* **Shared Logic:** For datasets, the command utilizes `QueryObjectFactory`
to satisfy SIP-185 calculation parity. For semantic views, it uses the existing
mapper and provider infrastructure.
* **Unified Query Model:** Adopts semantic layer terminology (`dimensions`
instead of `columns`) as the standard query vocabulary across all datasource
types.
* **Schema Validation:** Reuses the semantic layer mapper's validation logic
(metric/dimension name checking against the datasource's definitions) to
provide helpful error messages.
* **Capability Awareness:** The API reports what features a datasource
supports, since not all datasources support all query features (e.g.,
`GROUP_LIMIT`, `ADHOC_EXPRESSIONS_IN_ORDERBY`).
### New or Changed Public Interfaces
#### Query Endpoint
**Endpoint:** `POST /api/v1/datasource/{type}/{id}/query`
**Request Payload:**
```json
{
"dimensions": ["region", "product_category"],
"metrics": ["sum__sales", "unique_users"],
"filters": [
{"col": "order_date", "op": "TEMPORAL_RANGE", "val": "Last 7 days"}
],
"order": [{"column": "sum__sales", "descending": true}],
"limit": 100,
"offset": 0,
"result_format": "json",
"time_grain": "P1D"
}
```
Key terminology changes from the original proposal:
* `columns` -> `dimensions` (aligns with semantic layer convention;
"columns" is ambiguous)
* `series_limit` -> `limit` / `offset` (simpler, standard pagination)
* `order_desc` -> `order` array (more flexible, supports multi-column
ordering)
* Added `time_grain` using ISO 8601 duration format (e.g., `P1D` for day,
`PT1H` for hour), consistent with the `Grain` type in `superset-core`
**Filter format:**
Filters follow the same structure used in both Superset's existing filter
system and the semantic layer, ensuring compatibility:
```json
{
"col": "column_name",
"op": "EQUALS | NOT_EQUALS | IN | NOT_IN | GREATER_THAN | LESS_THAN |
TEMPORAL_RANGE | IS_NULL | IS_NOT_NULL",
"val": "value or [values]"
}
```
**Response Payload:**
```json
{
"result": {
"data": [...],
"columns": [
{"name": "region", "type": "STRING", "is_dimension": true},
{"name": "sum__sales", "type": "NUMERIC", "is_dimension": false}
],
"row_count": 100,
"query_id": "...",
"cached": false,
"cache_timeout": 300
},
"datasource": {
"type": "table",
"id": 42,
"name": "sales_dataset"
},
"capabilities": {
"supports_group_limit": true,
"supports_adhoc_orderby": true,
"supports_rls": true,
"query_language": "sql"
}
}
```
**Result formats:**
* `json` -- Default, JSON array of row objects
* `csv` -- CSV text response
* `arrow` -- Apache Arrow IPC format (high-performance; aligns with semantic
layer's internal use of PyArrow)
#### Capabilities Endpoint
**Endpoint:** `GET /api/v1/datasource/{type}/{id}/capabilities`
Returns what query features the datasource supports, based on the
`SemanticViewFeature` enum pattern:
```json
{
"supports_group_limit": true,
"supports_adhoc_orderby": true,
"supports_rls": true,
"supports_time_comparison": true,
"query_language": "sql",
"compatible_metric_filtering": false
}
```
This enables clients to adapt their queries based on datasource capabilities
before executing.
### Execution Architecture
```
Client Request (simplified payload)
|
v
DatasourceQueryCommand (in superset-core)
|
+-- validates dimensions/metrics against datasource definitions
| (reuses mapper validation logic)
|
+-- dispatches based on datasource type:
|
+--[table]-----------> QueryObjectFactory -> QueryObject -> SQL
execution
| (SIP-185 calculation parity)
|
+--[semantic_view]---> SemanticQuery (direct mapping) -> Provider
execution
| (Snowflake, dbt, Cube, etc.)
|
v
Unified QueryResult -> Response (json / csv / arrow)
```
### New Dependencies
None. This proposal reuses existing infrastructure:
* `SemanticQuery` types from `superset-core` (PR #37815)
* `Explorable` protocol from `superset/explorables/base.py` (PR #37816)
* `DatasourceDAO.sources_dict` for datasource type dispatch (PR #37817)
* `QueryObjectFactory` for dataset query execution (existing)
* Mapper validation logic from `superset/semantic_layers/mapper.py` (PR
#37815)
### Migration Plan and Compatibility
* This is a new endpoint; no breaking changes to existing APIs.
* No database migrations required.
* The existing `POST /api/v1/chart/data` endpoint continues to work
unchanged.
* The convenience alias `POST /api/v1/dataset/{id}/query` preserves the
original SIP-199 intent for callers that only need dataset support.
* Semantic layer PRs must land first (at minimum PR #37815 for types and PR
#37816 for models/Explorable protocol).
### Implementation Phases
**Phase 1 -- Dataset Query API (MVP)**
* Implement `POST /api/v1/dataset/{id}/query` using `QueryObjectFactory`
* Accept the unified payload schema (dimensions, metrics, filters)
* Return JSON results
* Validate metric/dimension names against dataset definitions
**Phase 2 -- Unified Datasource Endpoint**
* Add `POST /api/v1/datasource/{type}/{id}/query` with dispatch logic
* Add semantic view support (once semantic layer PRs are merged)
* Add capabilities endpoint
**Phase 3 -- Advanced Formats and Features**
* Arrow result format support
* Time comparison queries (time_offsets parameter)
* Group limit support
* Compatible metric/dimension filtering (for semantic layers that require it)
### Rejected Alternatives
1. **Keep using the existing `POST /api/v1/chart/data` API**
Rejected because it requires chart-specific metadata and a complex
QueryContext payload unsuitable for headless consumption.
2. **Build a dataset-only endpoint with its own query model**
Rejected because the semantic layer work already defines `SemanticQuery`
as a simplified query representation. Building a parallel model creates
fragmentation. The semantic layer naming conventions (dimensions, metrics) are
industry-standard and should be adopted.
3. **Use `columns` instead of `dimensions` in the payload**
Rejected because `dimensions` is the standard semantic layer term (used
by dbt, Snowflake Cortex, Cube, Minerva, and the Superset semantic layer
implementation). Using `columns` would create a vocabulary mismatch between the
dataset query API and the semantic view query API.
4. **Separate endpoints per datasource type with different schemas**
Rejected because a unified schema and dispatch mechanism is simpler for
consumers (especially AI agents) and avoids API proliferation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]