shivamgoel commented on issue #37535:
URL: https://github.com/apache/superset/issues/37535#issuecomment-4115843954

   ## [SIP-199] Updated to incorporate SIP-182 (Semantic Layer support) changes
   
   ### Motivation
   
   As Apache Superset moves toward a "Library-First" architecture (per SIP-187) 
and seeks to centralize analytical logic (per SIP-185), the current API surface 
remains heavily tied to the Visualization (Chart) layer.
   
   Currently, external services or AI agents wishing to query a Dataset must 
either:
   
   1. Reference a slice_id (Chart), which forces a dependency on UI state.
   2. Construct a complex query_context designed for frontend state management.
   
   Additionally, the ongoing Semantic Layer work (SIP-182, PRs #37815-#38396) 
introduces a new `SemanticView` datasource type with its own query model 
(`SemanticQuery`) and mapper infrastructure. This creates an opportunity -- and 
a risk. If SIP-199 is implemented in isolation for datasets only, Superset will 
end up with two parallel query APIs for two datasource types. Instead, we 
should build a single, unified headless query API that works across all 
`Explorable` datasource types.
   
   We need a Datasource-centric API that allows services to fetch data using 
semantic definitions (metrics/dimensions) of any datasource -- whether a 
traditional Dataset or a Semantic View -- treating Superset as a headless 
semantic layer.
   
   
   ### Relationship to Semantic Layer Work
   
   The semantic layer PRs (#37815-#38396) introduce:
   
   * **`SemanticQuery`** -- a simplified query type with `metrics`, 
`dimensions`, `filters`, `order`, `limit`, `offset`
   * **`SemanticView`** -- a new datasource type implementing the `Explorable` 
protocol
   * **Mapper** (`superset/semantic_layers/mapper.py`) -- translates between 
chart-oriented `QueryObject` and the simplified `SemanticQuery`
   * **`Explorable` protocol** (`superset/explorables/base.py`) -- a shared 
interface that both `BaseDatasource` (datasets) and `SemanticView` implement
   
   SIP-199's simplified query payload is nearly identical in structure to 
`SemanticQuery`. Rather than creating a parallel query model, this SIP adopts 
`SemanticQuery` conventions as the unified query vocabulary for the headless 
API.
   
   
   ### Proposed Change
   
   I propose a new REST endpoint that works with any Explorable datasource:
   
   **Primary endpoint:**
   ```
   POST /api/v1/datasource/{type}/{id}/query
   ```
   Where `type` is `table` (for datasets) or `semantic_view` (for semantic 
views), matching the existing `DatasourceType` enum.
   
   **Convenience aliases:**
   ```
   POST /api/v1/dataset/{id}/query        -> resolves to type=table
   POST /api/v1/semantic_view/{id}/query  -> resolves to type=semantic_view
   ```
   
   This endpoint will accept a simplified JSON payload and execute it via a new 
`DatasourceQueryCommand`. This command will dispatch to the appropriate 
execution path based on datasource type:
   
   * **For datasets (`table`):** Converts the simplified payload to a 
`QueryObject` via `QueryObjectFactory`, then executes through the existing SQL 
pipeline.
   * **For semantic views (`semantic_view`):** The payload maps directly to 
`SemanticQuery` and passes through the semantic layer provider (Snowflake, dbt, 
etc.).
   
   Key Architectural Principles:
   
   * **Library-First:** The core `DatasourceQueryCommand` interface will reside 
in `superset-core` (not depend on Flask request context), making it usable by 
the SIP-187 MCP service and extensions.
   * **Shared Logic:** For datasets, the command utilizes `QueryObjectFactory` 
to satisfy SIP-185 calculation parity. For semantic views, it uses the existing 
mapper and provider infrastructure.
   * **Unified Query Model:** Adopts semantic layer terminology (`dimensions` 
instead of `columns`) as the standard query vocabulary across all datasource 
types.
   * **Schema Validation:** Reuses the semantic layer mapper's validation logic 
(metric/dimension name checking against the datasource's definitions) to 
provide helpful error messages.
   * **Capability Awareness:** The API reports what features a datasource 
supports, since not all datasources support all query features (e.g., 
`GROUP_LIMIT`, `ADHOC_EXPRESSIONS_IN_ORDERBY`).
   
   
   ### New or Changed Public Interfaces
   
   #### Query Endpoint
   
   **Endpoint:** `POST /api/v1/datasource/{type}/{id}/query`
   
   **Request Payload:**
   
   ```json
   {
     "dimensions": ["region", "product_category"],
     "metrics": ["sum__sales", "unique_users"],
     "filters": [
       {"col": "order_date", "op": "TEMPORAL_RANGE", "val": "Last 7 days"}
     ],
     "order": [{"column": "sum__sales", "descending": true}],
     "limit": 100,
     "offset": 0,
     "result_format": "json",
     "time_grain": "P1D"
   }
   ```
   
   Key terminology changes from the original proposal:
   * `columns` -> `dimensions` (aligns with semantic layer convention; 
"columns" is ambiguous)
   * `series_limit` -> `limit` / `offset` (simpler, standard pagination)
   * `order_desc` -> `order` array (more flexible, supports multi-column 
ordering)
   * Added `time_grain` using ISO 8601 duration format (e.g., `P1D` for day, 
`PT1H` for hour), consistent with the `Grain` type in `superset-core`
   
   **Filter format:**
   
   Filters follow the same structure used in both Superset's existing filter 
system and the semantic layer, ensuring compatibility:
   
   ```json
   {
     "col": "column_name",
     "op": "EQUALS | NOT_EQUALS | IN | NOT_IN | GREATER_THAN | LESS_THAN | 
TEMPORAL_RANGE | IS_NULL | IS_NOT_NULL",
     "val": "value or [values]"
   }
   ```
   
   **Response Payload:**
   
   ```json
   {
     "result": {
       "data": [...],
       "columns": [
         {"name": "region", "type": "STRING", "is_dimension": true},
         {"name": "sum__sales", "type": "NUMERIC", "is_dimension": false}
       ],
       "row_count": 100,
       "query_id": "...",
       "cached": false,
       "cache_timeout": 300
     },
     "datasource": {
       "type": "table",
       "id": 42,
       "name": "sales_dataset"
     },
     "capabilities": {
       "supports_group_limit": true,
       "supports_adhoc_orderby": true,
       "supports_rls": true,
       "query_language": "sql"
     }
   }
   ```
   
   **Result formats:**
   * `json` -- Default, JSON array of row objects
   * `csv` -- CSV text response
   * `arrow` -- Apache Arrow IPC format (high-performance; aligns with semantic 
layer's internal use of PyArrow)
   
   #### Capabilities Endpoint
   
   **Endpoint:** `GET /api/v1/datasource/{type}/{id}/capabilities`
   
   Returns what query features the datasource supports, based on the 
`SemanticViewFeature` enum pattern:
   
   ```json
   {
     "supports_group_limit": true,
     "supports_adhoc_orderby": true,
     "supports_rls": true,
     "supports_time_comparison": true,
     "query_language": "sql",
     "compatible_metric_filtering": false
   }
   ```
   
   This enables clients to adapt their queries based on datasource capabilities 
before executing.
   
   
   ### Execution Architecture
   
   ```
   Client Request (simplified payload)
          |
          v
   DatasourceQueryCommand (in superset-core)
          |
          +-- validates dimensions/metrics against datasource definitions
          |   (reuses mapper validation logic)
          |
          +-- dispatches based on datasource type:
          |
          +--[table]-----------> QueryObjectFactory -> QueryObject -> SQL 
execution
          |                      (SIP-185 calculation parity)
          |
          +--[semantic_view]---> SemanticQuery (direct mapping) -> Provider 
execution
          |                      (Snowflake, dbt, Cube, etc.)
          |
          v
   Unified QueryResult -> Response (json / csv / arrow)
   ```
   
   
   ### New Dependencies
   
   None. This proposal reuses existing infrastructure:
   
   * `SemanticQuery` types from `superset-core` (PR #37815)
   * `Explorable` protocol from `superset/explorables/base.py` (PR #37816)
   * `DatasourceDAO.sources_dict` for datasource type dispatch (PR #37817)
   * `QueryObjectFactory` for dataset query execution (existing)
   * Mapper validation logic from `superset/semantic_layers/mapper.py` (PR 
#37815)
   
   
   ### Migration Plan and Compatibility
   
   * This is a new endpoint; no breaking changes to existing APIs.
   * No database migrations required.
   * The existing `POST /api/v1/chart/data` endpoint continues to work 
unchanged.
   * The convenience alias `POST /api/v1/dataset/{id}/query` preserves the 
original SIP-199 intent for callers that only need dataset support.
   * Semantic layer PRs must land first (at minimum PR #37815 for types and PR 
#37816 for models/Explorable protocol).
   
   
   ### Implementation Phases
   
   **Phase 1 -- Dataset Query API (MVP)**
   * Implement `POST /api/v1/dataset/{id}/query` using `QueryObjectFactory`
   * Accept the unified payload schema (dimensions, metrics, filters)
   * Return JSON results
   * Validate metric/dimension names against dataset definitions
   
   **Phase 2 -- Unified Datasource Endpoint**
   * Add `POST /api/v1/datasource/{type}/{id}/query` with dispatch logic
   * Add semantic view support (once semantic layer PRs are merged)
   * Add capabilities endpoint
   
   **Phase 3 -- Advanced Formats and Features**
   * Arrow result format support
   * Time comparison queries (time_offsets parameter)
   * Group limit support
   * Compatible metric/dimension filtering (for semantic layers that require it)
   
   
   ### Rejected Alternatives
   
   1. **Keep using the existing `POST /api/v1/chart/data` API**
      Rejected because it requires chart-specific metadata and a complex 
QueryContext payload unsuitable for headless consumption.
   
   2. **Build a dataset-only endpoint with its own query model**
      Rejected because the semantic layer work already defines `SemanticQuery` 
as a simplified query representation. Building a parallel model creates 
fragmentation. The semantic layer naming conventions (dimensions, metrics) are 
industry-standard and should be adopted.
   
   3. **Use `columns` instead of `dimensions` in the payload**
      Rejected because `dimensions` is the standard semantic layer term (used 
by dbt, Snowflake Cortex, Cube, Minerva, and the Superset semantic layer 
implementation). Using `columns` would create a vocabulary mismatch between the 
dataset query API and the semantic view query API.
   
   4. **Separate endpoints per datasource type with different schemas**
      Rejected because a unified schema and dispatch mechanism is simpler for 
consumers (especially AI agents) and avoids API proliferation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to