RahulBisht001 opened a new issue, #37427:
URL: https://github.com/apache/superset/issues/37427
# SIP: Git-backed Version Control for Datasets in Apache Superset
## 1. Motivation
Apache Superset is widely used as a self-service BI and data exploration
tool. While charts and dashboards are visually inspectable and often reviewed
collaboratively, **datasets—especially SQL-based virtual datasets—form the
critical foundation** of all downstream analytics.
Currently, Superset stores dataset definitions (including SQL queries) in
its metadata database. This creates several problems:
* **No native versioning**: Changes to SQL queries overwrite previous
definitions, making it difficult or impossible to recover prior logic.
* **Poor traceability**: Teams cannot easily answer *who changed what, when,
and why*.
* **Risk of accidental data loss**: SQL logic can be lost due to mistakes,
overwrites, or instance-level failures.
* **Lack of review workflow**: Dataset changes cannot be reviewed using
standard engineering practices like pull requests.
In modern data teams, SQL is treated as code. However, Superset currently
treats it as mutable application state.
This SIP proposes introducing **optional Git-backed version control for
Superset datasets**, enabling teams to manage dataset SQL definitions as code.
---
## 2. Problem Statement
There is no built-in mechanism in Apache Superset to:
* Version datasets or dataset SQL definitions
* Track historical changes to datasets
* Enforce review or approval workflows for dataset changes
* Restore previous dataset definitions after accidental modification
This gap becomes more severe as Superset adoption grows in production and
regulated environments.
---
## 3. Proposed Solution (High Level)
Introduce an **optional Git integration layer** for dataset management in
Apache Superset.
Key idea:
> Treat datasets (especially SQL-based virtual datasets) as
version-controlled artifacts stored in a Git repository.
Superset remains the UI and execution layer, while Git becomes the **source
of truth** for dataset definitions.
---
## 4. Scope of Versioning
Initial scope (intentionally limited):
* **SQL-based virtual datasets**
* Dataset metadata:
* Dataset name
* Database connection reference
* SQL query
* Schema (optional)
* Description / owner
Out of scope (for initial version):
* Dashboards and charts
* Native table metadata
* Row-level security rules
---
## 5. Architecture Overview
### 5.1 Git Repository Structure (Example)
```text
superset-datasets/
├── sales/
│ ├── daily_revenue.sql
│ ├── daily_revenue.yaml
├── marketing/
│ ├── campaign_performance.sql
│ ├── campaign_performance.yaml
```
* `.sql` files contain dataset queries
* `.yaml` (or `.json`) contains Superset metadata
---
### 5.2 Superset Integration Flow
1. Organization configures a Git provider (GitHub / GitLab / Bitbucket)
2. Superset instance is linked to a repository (read-only or read-write)
3. Each dataset can optionally be marked as **Git-backed**
4. Dataset lifecycle:
* Create / update dataset → change is written to Git
* Commit message is required
* Superset syncs dataset from repository
---
## 6. Change Management Models
### Model A: Superset → Git (Push-based)
* User edits dataset in Superset UI
* Superset requires:
* Commit message
* Branch (optional)
* Superset commits changes to Git
Best for teams starting with Superset-first workflows.
---
### Model B: Git → Superset (Pull-based)
* Dataset SQL edited directly in Git
* Superset periodically syncs (or via webhook)
* Changes are reflected in Superset UI
Best for engineering-heavy teams.
---
## 7. Governance and Controls
Optional enforcement policies:
* Require Git-backed datasets for production workspaces
* Disable direct UI editing (read-only mode)
* Enforce pull-request-based approvals
* Map Superset users to Git identities
---
## 8. Benefits
### 8.1 For Users
* Never lose SQL logic
* Clear change history
* Ability to roll back
* Confidence in dataset correctness
### 8.2 For Organizations
* Improved data governance
* Auditability and compliance
* Standardized dataset review process
* Reduced production incidents
### 8.3 For Superset Ecosystem
* Aligns Superset with modern Analytics Engineering practices
* Encourages GitOps-style workflows
* Makes Superset more enterprise-ready
---
## 9. Backward Compatibility
* Feature is **opt-in**
* Existing datasets continue to work unchanged
* No breaking changes to metadata storage
---
## 10. Implementation Considerations
* Handling merge conflicts
* Secure storage of Git credentials
* Multi-branch support
* Performance impact of sync operations
These concerns can be addressed incrementally.
---
## 11. Future Extensions
* Chart and dashboard versioning
* Dataset lineage visualization
* CI checks for SQL validity
* Integration with dbt and semantic layers
---
## 12. Conclusion
This SIP proposes a focused, opt-in Git integration for dataset versioning
in Apache Superset. By treating datasets as code, Superset can significantly
improve reliability, collaboration, and trust in analytics workflows.
This feature fills a real gap for production-grade usage and aligns Superset
with industry best practices.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]