[I] Git-backed Version Control for Datasets in Apache Superset [superset]

via GitHub Sat, 24 Jan 2026 21:05:50 -0800


RahulBisht001 opened a new issue, #37427:
URL: https://github.com/apache/superset/issues/37427


   # SIP: Git-backed Version Control for Datasets in Apache Superset
   
   ## 1. Motivation
   
   Apache Superset is widely used as a self-service BI and data exploration 
tool. While charts and dashboards are visually inspectable and often reviewed 
collaboratively, **datasets—especially SQL-based virtual datasets—form the 
critical foundation** of all downstream analytics.
   
   Currently, Superset stores dataset definitions (including SQL queries) in 
its metadata database. This creates several problems:
   
   * **No native versioning**: Changes to SQL queries overwrite previous 
definitions, making it difficult or impossible to recover prior logic.
   * **Poor traceability**: Teams cannot easily answer *who changed what, when, 
and why*.
   * **Risk of accidental data loss**: SQL logic can be lost due to mistakes, 
overwrites, or instance-level failures.
   * **Lack of review workflow**: Dataset changes cannot be reviewed using 
standard engineering practices like pull requests.
   
   In modern data teams, SQL is treated as code. However, Superset currently 
treats it as mutable application state.
   
   This SIP proposes introducing **optional Git-backed version control for 
Superset datasets**, enabling teams to manage dataset SQL definitions as code.
   
   ---
   
   ## 2. Problem Statement
   
   There is no built-in mechanism in Apache Superset to:
   
   * Version datasets or dataset SQL definitions
   * Track historical changes to datasets
   * Enforce review or approval workflows for dataset changes
   * Restore previous dataset definitions after accidental modification
   
   This gap becomes more severe as Superset adoption grows in production and 
regulated environments.
   
   ---
   
   ## 3. Proposed Solution (High Level)
   
   Introduce an **optional Git integration layer** for dataset management in 
Apache Superset.
   
   Key idea:
   
   > Treat datasets (especially SQL-based virtual datasets) as 
version-controlled artifacts stored in a Git repository.
   
   Superset remains the UI and execution layer, while Git becomes the **source 
of truth** for dataset definitions.
   
   ---
   
   ## 4. Scope of Versioning
   
   Initial scope (intentionally limited):
   
   * **SQL-based virtual datasets**
   * Dataset metadata:
   
     * Dataset name
     * Database connection reference
     * SQL query
     * Schema (optional)
     * Description / owner
   
   Out of scope (for initial version):
   
   * Dashboards and charts
   * Native table metadata
   * Row-level security rules
   
   ---
   
   ## 5. Architecture Overview
   
   ### 5.1 Git Repository Structure (Example)
   
   ```text
   superset-datasets/
     ├── sales/
     │   ├── daily_revenue.sql
     │   ├── daily_revenue.yaml
     ├── marketing/
     │   ├── campaign_performance.sql
     │   ├── campaign_performance.yaml
   ```
   
   * `.sql` files contain dataset queries
   * `.yaml` (or `.json`) contains Superset metadata
   
   ---
   
   ### 5.2 Superset Integration Flow
   
   1. Organization configures a Git provider (GitHub / GitLab / Bitbucket)
   2. Superset instance is linked to a repository (read-only or read-write)
   3. Each dataset can optionally be marked as **Git-backed**
   4. Dataset lifecycle:
   
      * Create / update dataset → change is written to Git
      * Commit message is required
      * Superset syncs dataset from repository
   
   ---
   
   ## 6. Change Management Models
   
   ### Model A: Superset → Git (Push-based)
   
   * User edits dataset in Superset UI
   * Superset requires:
   
     * Commit message
     * Branch (optional)
   * Superset commits changes to Git
   
   Best for teams starting with Superset-first workflows.
   
   ---
   
   ### Model B: Git → Superset (Pull-based)
   
   * Dataset SQL edited directly in Git
   * Superset periodically syncs (or via webhook)
   * Changes are reflected in Superset UI
   
   Best for engineering-heavy teams.
   
   ---
   
   ## 7. Governance and Controls
   
   Optional enforcement policies:
   
   * Require Git-backed datasets for production workspaces
   * Disable direct UI editing (read-only mode)
   * Enforce pull-request-based approvals
   * Map Superset users to Git identities
   
   ---
   
   ## 8. Benefits
   
   ### 8.1 For Users
   
   * Never lose SQL logic
   * Clear change history
   * Ability to roll back
   * Confidence in dataset correctness
   
   ### 8.2 For Organizations
   
   * Improved data governance
   * Auditability and compliance
   * Standardized dataset review process
   * Reduced production incidents
   
   ### 8.3 For Superset Ecosystem
   
   * Aligns Superset with modern Analytics Engineering practices
   * Encourages GitOps-style workflows
   * Makes Superset more enterprise-ready
   
   ---
   
   ## 9. Backward Compatibility
   
   * Feature is **opt-in**
   * Existing datasets continue to work unchanged
   * No breaking changes to metadata storage
   
   ---
   
   ## 10. Implementation Considerations
   
   * Handling merge conflicts
   * Secure storage of Git credentials
   * Multi-branch support
   * Performance impact of sync operations
   
   These concerns can be addressed incrementally.
   
   ---
   
   ## 11. Future Extensions
   
   * Chart and dashboard versioning
   * Dataset lineage visualization
   * CI checks for SQL validity
   * Integration with dbt and semantic layers
   
   ---
   
   ## 12. Conclusion
   
   This SIP proposes a focused, opt-in Git integration for dataset versioning 
in Apache Superset. By treating datasets as code, Superset can significantly 
improve reliability, collaboration, and trust in analytics workflows.
   
   This feature fills a real gap for production-grade usage and aligns Superset 
with industry best practices.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Git-backed Version Control for Datasets in Apache Superset [superset]

Reply via email to