GitHub user raiffeisenbankinternational-bot edited a discussion: Automatic
DagBundle Configuration Loading
## Summary
This proposal requests a new feature in Apache Airflow 3.x to automatically
load DagBundle configurations from local JSON files using the `file://`
protocol, with automatic reloading support.
**Current Limitation**: Airflow 3.x requires DagBundle configurations to be
manually embedded as JSON strings in `airflow.cfg` or set as environment
variables, making configurations difficult to manage and update.
**Proposed Solution**: Extend the existing `dag_bundle_config_list` parameter
to support `file://` URLs pointing to local JSON files, and add a
`dag_bundle_config_list_watch` parameter to enable automatic reloading when the
file changes.
---
## Problem Statement
### Current Airflow 3.x DagBundle Configuration
Airflow 3.x introduced DagBundles as a powerful way to load DAGs from multiple
Git repositories, enabling decentralized DAG management. However, the
configuration mechanism has limitations:
**Current Approach:**
```ini
[dag_processor]
dag_bundle_config_list = [{"name": "team1", "classpath":
"airflow.providers.git.bundles.git.GitDagBundle", "kwargs": {...}}, ...]
```
**Limitations:**
1. **JSON in Config Files**: Large JSON arrays embedded in INI files are
difficult to read and maintain
2. **No File Separation**: Cannot split configuration into separate, manageable
files
3. **Manual Updates Required**: Changing configurations requires editing
`airflow.cfg` or environment variables
4. **No Hot Reloading**: Configuration changes require restarting Airflow
services
5. **Version Control Challenges**: Hard to track configuration changes in
version control when embedded in config files
### Current Workaround
Organizations currently implement manual workarounds:
1. Generate `dag_bundle_config.json` externally (CI/CD, automation tools)
2. Parse JSON and convert to string
3. **Manually append to `airflow.cfg` or set as environment variable** (manual
step)
4. Restart Airflow services
**This manual conversion and restart process breaks automation and introduces
operational complexity.**
---
## Use Case: Multi-Team Airflow Environment with GitOps Onboarding
### Environment Overview
Our organization runs Apache Airflow 3.x serving multiple data engineering
teams in a shared platform model. We support over 30 teams with independent DAG
repositories, provide automated onboarding via GitHub workflows, maintain
decentralized DAG ownership where each team controls their own repository, and
follow GitOps-driven configuration management principles.
### Automated Onboarding Workflow
When a new team requests access to our Airflow platform, our automated workflow
generates their configuration and writes it to a local JSON file. However, to
apply these changes, we must either restart Airflow services or manually
convert the JSON to a string and update the configuration file.
```mermaid
flowchart TD
A[Team Creates Onboarding Issue] --> B[GitHub Workflow Triggered]
B --> C[Create Team Repository]
C --> D[Generate teams/team-name.yaml]
D --> E[Create Pull Request]
E --> F[Platform Team Reviews PR]
F --> G[PR Merged to Main]
G --> H[GitHub Actions: Generate Config]
H --> I[Write dag_bundle_config.json]
I --> J[Manual: Restart Airflow]
J --> K[Team Onboarded]
style J fill:#ffcccc
```
**Red box indicates manual intervention that breaks automation.**
### Current Pain Point
Each time the configuration changes, operators must either:
1. Restart all Airflow services (schedulers, webservers), or
2. Manually convert the JSON file to a string and update `airflow.cfg`
This introduces operational overhead, service interruptions, inability to add
teams without downtime, and configuration management complexity.
### What We Want
**Ideal Configuration:**
```ini
[dag_processor]
# Point to a local JSON file
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
# Enable automatic reloading when file changes
dag_bundle_config_list_watch = True
```
**Expected Behavior:**
1. Airflow scheduler starts and loads configuration from the JSON file
2. Watches the file for changes when `dag_bundle_config_list_watch = True`
3. Automatically reloads DagBundles when the file is updated
4. No service restart required for configuration changes
5. Simple, file-based approach suitable for most deployments
---
## Proposed Solution
### Enhanced `dag_bundle_config_list` Parameter
Extend the existing `dag_bundle_config_list` configuration parameter to support
both JSON strings and file:// URLs:
```ini
[dag_processor]
# Option 1: Traditional JSON string (current behavior, unchanged)
dag_bundle_config_list = [{"name": "team1", "classpath": "...", "kwargs":
{...}}]
# Option 2: Local file reference (new)
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
# Option 3: Relative path (new)
dag_bundle_config_list = file://config/dag_bundle_config.json
```
### New Configuration Parameter: `dag_bundle_config_list_watch`
Add a new parameter to enable automatic file watching and reloading:
```ini
[dag_processor]
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
dag_bundle_config_list_watch = True # Default: False
```
When `dag_bundle_config_list_watch = True`, Airflow monitors the file for
changes and automatically reloads the DagBundle configuration without requiring
a service restart.
### Future Possibilities
While this proposal focuses on local file support, the same
`dag_bundle_config_list` parameter could be extended in the future to support:
- **HTTP/HTTPS URLs**: `dag_bundle_config_list =
https://config-server.example.com/config.json`
- **S3 URLs**: `dag_bundle_config_list = s3://bucket/path/config.json`
- **Other protocols**: Git, GCS, Azure Blob Storage, etc.
The `file://` protocol provides immediate value with minimal complexity, while
establishing a pattern for future enhancements.
### Implementation Requirements
#### 1. File Protocol Support
| Format | Example | Description |
|--------|---------|-------------|
| Absolute path | `file:///etc/airflow/config/dag_bundle_config.json` |
Absolute filesystem path |
| Relative path | `file://config/dag_bundle_config.json` | Relative to
`$AIRFLOW_HOME` |
| JSON string | `[{"name": "...", ...}]` | Current behavior (unchanged) |
#### 2. Loading Behavior
**On Scheduler Start:**
1. Check if `dag_bundle_config_list` starts with `file://`
2. If yes, read and parse the JSON file
3. If no, treat as JSON string (current behavior)
4. Validate configuration against DagBundle schema
5. Load all bundles defined in configuration
6. If file read fails, log error and start with no bundles
7. If JSON parsing fails, log error and start with no bundles
**File Watching (when `dag_bundle_config_list_watch = True`):**
1. Use filesystem watching (e.g., `inotify` on Linux, `FSEvents` on macOS)
2. Monitor the file for modifications
3. When file changes detected, reload configuration
4. Compare new configuration with current state
5. Add new bundles, remove deleted bundles
6. Log all configuration changes
#### 3. Backward Compatibility
- **Existing behavior preserved**: JSON string format continues to work exactly
as before
- **Automatic detection**: If value starts with `file://`, treat as file path;
otherwise treat as JSON string
- **No breaking changes**: Existing deployments work without modification
- **Opt-in file watching**: `dag_bundle_config_list_watch` defaults to `False`
#### 4. Configuration Validation
- Validate JSON file against DagBundle schema
- Log validation errors with clear messages and file location
- Reject invalid configurations (fail-safe behavior)
- Support schema version compatibility
#### 5. Error Handling
| Scenario | Behavior |
|----------|----------|
| File not found | Log error, start with no bundles |
| File not readable | Log permission error, start with no bundles |
| Invalid JSON | Log parse error with line/column, start with no bundles |
| Schema validation fails | Log validation errors, reject configuration |
| File watch fails | Log warning, disable watching, continue with current
config |
#### 6. Observability
**Logging:**
```
[INFO] Loading DagBundle configuration from
file:///etc/airflow/config/dag_bundle_config.json
[INFO] Successfully loaded 15 DagBundle configurations
[INFO] File watch enabled for /etc/airflow/config/dag_bundle_config.json
[INFO] Configuration file changed, reloading...
[INFO] Detected configuration change: 1 bundle added, 0 removed
[WARN] File watch not supported on this platform, disabling automatic reload
[ERROR] Failed to read configuration file: Permission denied
[ERROR] Invalid JSON in configuration file at line 23, column 5: unexpected
token
```
---
## Benefits
### Simplified Configuration Management
Clean separation of concerns by keeping configuration in separate JSON files
rather than embedded in INI files. This provides easier readability and
maintenance of large configurations, better version control with clear diffs
showing exactly what changed, and the ability to use standard JSON tools for
validation and formatting.
### Zero-Downtime Updates
With file watching enabled, teams can be added or removed without restarting
Airflow services. Configuration changes take effect automatically within
seconds, eliminating service interruptions and enabling continuous operations.
### GitOps-Friendly
This approach supports configuration as code stored in Git repositories with
all changes tracked through pull requests. It enables automated deployment
pipelines and provides self-loading configuration without manual intervention.
### Developer Experience
Developers benefit from standard JSON format that's easier to edit than
embedded strings in INI files. IDE support with syntax highlighting and
validation is readily available, and configurations can be tested and validated
before deployment.
### Scalability
This solution supports any number of teams without configuration file bloat,
makes it easy to add or remove teams dynamically, and is suitable for both
small and large deployments.
### Cloud-Native Ready
The approach is container-friendly with file mounts, ready for Kubernetes
ConfigMaps or volume mounts, and compatible with immutable infrastructure
patterns.
### Simple Implementation
The solution focuses on file:// protocol first, minimizing complexity with no
network dependencies or authentication requirements. It uses standard
filesystem operations and can be extended in the future to support remote
protocols.
---
## Example Configurations
### Example 1: Basic File Reference
```ini
[dag_processor]
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
```
**Configuration File** (`/etc/airflow/config/dag_bundle_config.json`):
```json
[
{
"name": "analytics",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {
"repo_url": "https://github.com/org/airflow-teams-analytics-prod",
"tracking_ref": "main",
"refresh_interval": 60,
"subdir": "dags"
}
},
{
"name": "finance",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {
"repo_url": "https://github.com/org/airflow-teams-finance-prod",
"tracking_ref": "main",
"refresh_interval": 60,
"subdir": "dags"
}
}
]
```
### Example 2: With File Watching Enabled
```ini
[dag_processor]
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
dag_bundle_config_list_watch = True
```
Now when you update `/etc/airflow/config/dag_bundle_config.json`, changes are
automatically detected and applied without restarting Airflow.
### Example 3: Relative Path
```ini
[dag_processor]
# Relative to $AIRFLOW_HOME
dag_bundle_config_list = file://config/dag_bundle_config.json
dag_bundle_config_list_watch = True
```
### Example 4: Traditional JSON String (Unchanged)
```ini
[dag_processor]
# Current behavior still works
dag_bundle_config_list = [{"name": "team1", "classpath":
"airflow.providers.git.bundles.git.GitDagBundle", "kwargs": {"repo_url":
"https://github.com/org/team1-dags"}}]
```
### Example 5: Environment Variable
```bash
export
AIRFLOW__DAG_PROCESSOR__DAG_BUNDLE_CONFIG_LIST="file:///etc/airflow/config/dag_bundle_config.json"
export AIRFLOW__DAG_PROCESSOR__DAG_BUNDLE_CONFIG_LIST_WATCH="True"
```
### Example 6: Kubernetes ConfigMap
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: airflow-dag-bundles
data:
dag_bundle_config.json: |
[
{
"name": "analytics",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {
"repo_url": "https://github.com/org/analytics-dags",
"tracking_ref": "main",
"refresh_interval": 60,
"subdir": "dags"
}
}
]
---
# Mount ConfigMap as volume
spec:
volumes:
- name: dag-bundle-config
configMap:
name: airflow-dag-bundles
containers:
- name: scheduler
volumeMounts:
- name: dag-bundle-config
mountPath: /etc/airflow/config
readOnly: true
```
**airflow.cfg:**
```ini
[dag_processor]
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
dag_bundle_config_list_watch = True
```
---
## Alternative Workarounds (Current State)
While waiting for this feature, organizations implement various workarounds:
### Workaround 1: Startup Script with JSON Conversion
```bash
#!/bin/bash
# /usr/local/bin/airflow-with-config.sh
# Read JSON file and convert to string
export AIRFLOW__DAG_PROCESSOR__DAG_BUNDLE_CONFIG_LIST=$(cat
/etc/airflow/config/dag_bundle_config.json | jq -c .)
# Start Airflow
exec airflow scheduler
```
**Issues**: Requires wrapper scripts, not portable, configuration not
reloadable without restart
### Workaround 2: Config File Template with JSON Embedding
```ini
# airflow.cfg.template
[dag_processor]
dag_bundle_config_list = {{DAG_BUNDLE_CONFIG}}
```
**Deployment script:**
```bash
# Read JSON and inject into template
CONFIG=$(cat /etc/airflow/config/dag_bundle_config.json | jq -c .)
sed "s|{{DAG_BUNDLE_CONFIG}}|${CONFIG}|g" airflow.cfg.template > airflow.cfg
```
**Issues**: Templating complexity, escaping problems, manual deployment steps
### Workaround 3: Manual airflow.cfg Editing
Manually copy-paste JSON content into `airflow.cfg`:
```ini
[dag_processor]
dag_bundle_config_list =
[{"name":"team1","classpath":"..."},{"name":"team2","classpath":"..."}]
```
**Issues**: Error-prone, hard to maintain, difficult to track changes, requires
restart
**All workarounds share common problems**: They are fragile and error-prone,
require custom automation or manual intervention, lack official support, are
difficult to maintain, and always require service restarts for changes.
---
## Implementation Proposal
### Phase 1: File Protocol Support (Initial Implementation)
The initial implementation focuses on the `file://` protocol for simplicity and
immediate value:
1. **Detect `file://` prefix** in `dag_bundle_config_list` parameter
2. **Parse file path** (supporting both absolute and relative paths)
3. **Read and parse JSON** from the specified file
4. **Validate configuration** against DagBundle schema
5. **Load bundles** as defined in the configuration
6. **Fall back gracefully** if file is missing or invalid
This requires extending the configuration parser in the `dag_processor` module
to detect and handle `file://` URLs differently from JSON strings.
### Phase 2: File Watching (Automatic Reload)
Add support for the `dag_bundle_config_list_watch` parameter:
1. **Use platform-specific file watching** (e.g., `watchdog` library for Python)
2. **Monitor file for modifications** when watch is enabled
3. **Reload configuration** when changes are detected
4. **Apply changes incrementally** (add new bundles, remove deleted ones)
5. **Log all configuration changes** for auditability
This enables zero-downtime updates and eliminates the need for service restarts
when adding or removing teams.
### Configuration Parameters
Add to the `[dag_processor]` section:
```ini
[dag_processor]
# Existing parameter (enhanced to support file:// URLs)
dag_bundle_config_list = file:///path/to/config.json
# New parameter (opt-in file watching)
dag_bundle_config_list_watch = False # Default: False
```
### Future Enhancements (Out of Scope for Initial Implementation)
Once the `file://` protocol and watching mechanism are proven, the same pattern
can be extended to support remote protocols:
- **HTTP/HTTPS URLs**: Fetch configuration from web servers
- **S3 URLs**: Fetch from AWS S3 buckets
- **Other cloud storage**: GCS, Azure Blob Storage, etc.
These would use the same `dag_bundle_config_list` parameter with different URL
schemes and could reuse the file watching pattern with periodic polling.
---
## Technical Considerations
### 1. File Path Resolution
```
1. If starts with file://, extract path
2. If path is absolute (starts with /), use as-is
3. If path is relative, resolve from $AIRFLOW_HOME
4. Validate file exists and is readable
```
### 2. File Watching Implementation
Use the `watchdog` Python library or platform-specific mechanisms:
- **Linux**: inotify
- **macOS**: FSEvents
- **Windows**: ReadDirectoryChangesW
Debounce file change events to avoid multiple reloads during atomic writes.
### 3. Thread Safety
- Ensure thread-safe configuration updates
- Use locks during configuration reload
- Avoid race conditions during bundle registration/deregistration
### 4. Performance
- Read file synchronously on startup (blocking)
- Watch file asynchronously in background thread (non-blocking)
- Cache parsed JSON to avoid repeated parsing
- Debounce file change events (e.g., 1-second delay)
### 5. Testing
**Unit Tests:**
- Test file path parsing and resolution
- Test JSON parsing and validation
- Test file watching mechanism
- Test error handling for missing/invalid files
**Integration Tests:**
- Test scheduler startup with file:// config
- Test configuration reload when file changes
- Test fallback behavior when file is invalid
- Test with relative and absolute paths
**End-to-End Tests:**
- Test with real scheduler and multiple bundles
- Test adding/removing bundles dynamically
- Test with Kubernetes ConfigMap mounts
---
## Security Implications
### 1. File Permissions
**Best Practices:**
- Configuration files should be readable by the Airflow user
- Recommended permissions: `0644` or `0640` (read-only for Airflow user)
- Files should be owned by a trusted user (e.g., `root` or `airflow`)
- Avoid world-writable configuration files
**Example:**
```bash
chown airflow:airflow /etc/airflow/config/dag_bundle_config.json
chmod 0640 /etc/airflow/config/dag_bundle_config.json
```
### 2. Configuration Validation
- Validate JSON schema before applying
- Reject configurations with suspicious repository URLs
- Implement allowlist/blocklist for repository domains (if needed)
- Sanitize configuration values
### 3. Audit Logging
- Log when configuration file is loaded
- Log configuration changes detected
- Log file read errors and validation failures
- Include timestamps and file paths in logs
### 4. Access Control
- Restrict write access to configuration directory
- Use filesystem ACLs to control access
- In Kubernetes, use ReadOnlyRootFilesystem where appropriate
- Mount configuration as read-only volume when possible
---
## Migration Path for Existing Users
### Step 1: Create JSON Configuration File
```bash
# If you have existing configuration, extract it
# (Assume current config is a JSON string in airflow.cfg)
# Create configuration directory
mkdir -p /etc/airflow/config
# Create JSON file
cat > /etc/airflow/config/dag_bundle_config.json <<'EOF'
[
{
"name": "team1",
"classpath": "airflow.providers.git.bundles.git.GitDagBundle",
"kwargs": {
"repo_url": "https://github.com/org/team1-dags",
"tracking_ref": "main",
"refresh_interval": 60,
"subdir": "dags"
}
}
]
EOF
# Set appropriate permissions
chown airflow:airflow /etc/airflow/config/dag_bundle_config.json
chmod 0640 /etc/airflow/config/dag_bundle_config.json
```
### Step 2: Update Airflow Configuration
```ini
# Old configuration (comment out or remove)
# [dag_processor]
# dag_bundle_config_list = [{"name": "team1", ...}]
# New configuration
[dag_processor]
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
```
### Step 3: Test and Validate
```bash
# Restart scheduler
systemctl restart airflow-scheduler
# Verify bundles loaded
airflow dags list
# Check logs for successful loading
tail -f $AIRFLOW_HOME/logs/scheduler/latest/*.log | grep "dag_bundle"
```
### Step 4: Enable File Watching (Optional)
```ini
[dag_processor]
dag_bundle_config_list = file:///etc/airflow/config/dag_bundle_config.json
dag_bundle_config_list_watch = True
```
Now you can add teams by simply editing the JSON file - no restart required!
---
## Real-World Impact
### Our Organization
**Before this feature:**
Team onboarding takes 2-4 hours due to manual configuration conversion and
restart. Platform team involvement is high, requiring manual intervention. The
error rate is approximately 10% due to JSON escaping and formatting issues.
Service interruptions occur with every configuration change.
**With this feature:**
Team onboarding would be reduced to 10-15 minutes with simple file updates.
Platform team involvement would be minimal, limited to PR review only. Error
rates would drop to less than 1% with standard JSON editing. Zero downtime for
configuration changes with file watching enabled.
**Scale:**
Our deployment currently serves 30 or more onboarded teams managing 150 or more
DAG repositories containing over 500 active DAGs across 5 Airflow environments
including development, QA, staging, production, and disaster recovery.
### Community Benefits
This feature would benefit:
- **Large organizations** with multiple teams using Airflow
- **Platform teams** managing Airflow as a service
- **Cloud-native deployments** on Kubernetes with ConfigMaps
- **GitOps practitioners** seeking infrastructure as code
- **Development teams** wanting faster iteration cycles
- **Any deployment** with multiple teams or frequent configuration changes
---
## Related Airflow Features
This proposal complements existing Airflow 3.x features:
- **DagBundles**: Already support dynamic DAG loading from Git repositories
- **Configuration System**: Already supports `file://` URLs in some parameters
- **Dynamic DAG Loading**: Already loads DAGs dynamically from bundles
The missing piece this proposal addresses is:
- **File-based bundle configuration** with the `file://` protocol
- **Automatic reloading** of configuration changes without restarts
This is a natural extension of Airflow's existing configuration system and
DagBundle architecture.
---
## Questions for Discussion
1. **File Watching**: Should file watching be enabled by default, or opt-in via
`dag_bundle_config_list_watch`?
2. **Backward Compatibility**: Are there any concerns with auto-detecting
`file://` prefix?
3. **Configuration Format**: Should we support formats other than JSON (YAML,
TOML) in the future?
4. **Relative Paths**: Should relative paths be resolved from `$AIRFLOW_HOME`
or current working directory?
5. **Future Protocols**: After `file://` is stable, which remote protocol
should be prioritized next (HTTP, S3, Git)?
6. **Error Handling**: Should invalid configuration files prevent scheduler
startup or just log warnings?
7. **Watch Debouncing**: What's an appropriate debounce delay for file change
events (1 second, 5 seconds)?
---
## Conclusion
Apache Airflow 3.x's DagBundle feature is powerful but limited by requiring
JSON strings embedded in configuration files. Adding support for file-based
configuration loading via the `file://` protocol would:
- **Simplify configuration management** by separating concerns
- **Enable zero-downtime updates** with file watching
- **Improve developer experience** with standard JSON files
- **Support GitOps workflows** with version-controlled configurations
- **Provide a foundation** for future remote protocol support
This feature represents a simple, practical improvement that addresses real
operational pain points while establishing patterns for future enhancements
(HTTP, S3, etc.).
We believe this feature would be valuable to the entire Airflow community and
are willing to contribute to its implementation.
---
**Date**: 2025-12-25
**Target**: Apache Airflow 3.x
GitHub link: https://github.com/apache/airflow/discussions/59799
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]