(airflow) branch fix-registry-incremental-provider-flag updated: Fix incremental builds overwriting provider connection/parameter data on S3

kaxilnaik Tue, 17 Mar 2026 10:42:50 -0700

This is an automated email from the ASF dual-hosted git repository.

kaxilnaik pushed a commit to branch fix-registry-incremental-provider-flag
in repository https://gitbox.apache.org/repos/asf/airflow.git



The following commit(s) were added to 
refs/heads/fix-registry-incremental-provider-flag by this push:
     new 35b9324b885 Fix incremental builds overwriting provider 
connection/parameter data on S3
35b9324b885 is described below

commit 35b9324b885bd3082521dcc9a2b80479bd2efb6f
Author: Kaxil Naik <[email protected]>
AuthorDate: Tue Mar 17 17:42:06 2026 +0000

    Fix incremental builds overwriting provider connection/parameter data on S3
    
    Eleventy pagination templates emit empty fallback JSON for every provider,
    even when only one provider's data was extracted.  A plain `aws s3 sync`
    uploads those stubs and overwrites real connection/parameter data.
    
    Changes:
    - Exclude per-provider connections.json and parameters.json from the main
      S3 sync during incremental builds, then selectively upload only the
      target provider's API files
    - Filter connections early in extract_connections.py (before the loop)
      and support space-separated multi-provider IDs
    - Suppress SCARF_ANALYTICS and DO_NOT_TRACK telemetry in CI
    - Document the Eleventy pagination limitation in README and AGENTS.md
---
 .github/workflows/registry-build.yml | 32 +++++++++++++++++++++++++++++++-
 dev/registry/extract_connections.py  | 18 ++++++++++--------
 registry/AGENTS.md                   |  5 ++++-
 registry/README.md                   | 20 +++++++++++++++++---
 4 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/.github/workflows/registry-build.yml 
b/.github/workflows/registry-build.yml
index 94b2a4bb9d4..9fc61a9d70c 100644
--- a/.github/workflows/registry-build.yml
+++ b/.github/workflows/registry-build.yml
@@ -98,6 +98,8 @@ jobs:
     needs: [build-ci-image]
     runs-on: ubuntu-latest
     env:
+      SCARF_ANALYTICS: "false"
+      DO_NOT_TRACK: "1"
       EXISTING_REGISTRY_DIR: /tmp/existing-registry
       REGISTRY_DATA_DIR: dev/registry
       REGISTRY_PROVIDERS_JSON: providers.json
@@ -247,10 +249,38 @@ jobs:
       - name: "Sync registry to S3"
         env:
           S3_BUCKET: ${{ steps.destination.outputs.bucket }}
+          PROVIDER: ${{ inputs.provider }}
         run: |
+          # Incremental builds only extract connections/parameters for the
+          # target provider(s).  The Eleventy site build still emits empty
+          # stub JSON for every other provider.  Uploading those stubs would
+          # overwrite real data on S3, so we exclude per-provider API JSON
+          # from the main sync and selectively upload only the target
+          # provider's files afterward.
+          EXCLUDE_PROVIDER_API=()
+          if [[ -n "${PROVIDER}" ]]; then
+            EXCLUDE_PROVIDER_API=(
+              --exclude "api/providers/*/connections.json"
+              --exclude "api/providers/*/parameters.json"
+              --exclude "api/providers/*/*/connections.json"
+              --exclude "api/providers/*/*/parameters.json"
+            )
+          fi
+
           aws s3 sync registry/_site/ "${S3_BUCKET}" \
             --cache-control "${REGISTRY_CACHE_CONTROL}" \
-            --exclude "pagefind/*"
+            --exclude "pagefind/*" \
+            "${EXCLUDE_PROVIDER_API[@]}"
+
+          # For incremental builds, sync only the updated provider's API files.
+          if [[ -n "${PROVIDER}" ]]; then
+            for pid in ${PROVIDER}; do
+              aws s3 sync "registry/_site/api/providers/${pid}/" \
+                "${S3_BUCKET}api/providers/${pid}/" \
+                --cache-control "${REGISTRY_CACHE_CONTROL}"
+            done
+          fi
+
           # Pagefind generates content-hashed filenames (e.g. 
en_181da6f.pf_index).
           # Each rebuild produces new hashes, so --delete is needed to remove 
stale
           # index files. This is separate from the main sync which 
intentionally
diff --git a/dev/registry/extract_connections.py 
b/dev/registry/extract_connections.py
index ce81cc43a27..ebb269193c3 100644
--- a/dev/registry/extract_connections.py
+++ b/dev/registry/extract_connections.py
@@ -162,7 +162,7 @@ def main():
     parser.add_argument(
         "--provider",
         default=None,
-        help="Only output connections for this provider ID (e.g. 'amazon').",
+        help="Only output connections for these provider ID(s) 
(space-separated, e.g. 'amazon common-io').",
     )
     parser.add_argument(
         "--providers-json",
@@ -212,12 +212,21 @@ def main():
     total_with_custom = 0
     total_with_ui = 0
 
+    # Parse space-separated provider filter (matches extract_metadata.py 
behaviour)
+    provider_filter: set[str] | None = None
+    if args.provider:
+        provider_filter = {pid.strip() for pid in args.provider.split() if 
pid.strip()}
+        print(f"Filtering to provider(s): {', 
'.join(sorted(provider_filter))}")
+
     for conn_type, hook_info in sorted(hooks.items()):
         if hook_info is None or not hook_info.package_name:
             continue
 
         provider_id = package_name_to_provider_id(hook_info.package_name)
 
+        if provider_filter and provider_id not in provider_filter:
+            continue
+
         standard_fields = 
build_standard_fields(field_behaviours.get(conn_type))
         custom_fields = build_custom_fields(form_widgets, conn_type)
 
@@ -244,13 +253,6 @@ def main():
     print(f"  {total_with_custom} have custom fields")
     print(f"  {total_with_ui} have UI field customisation")
 
-    # Filter to single provider if requested
-    if args.provider:
-        provider_connections = {
-            pid: conns for pid, conns in provider_connections.items() if pid 
== args.provider
-        }
-        print(f"Filtering output to provider: {args.provider}")
-
     # Write per-provider files to versions/{pid}/{version}/connections.json
     for output_dir in OUTPUT_DIRS:
         if not output_dir.parent.exists():
diff --git a/registry/AGENTS.md b/registry/AGENTS.md
index fc7ce31fb0f..b6416ec0244 100644
--- a/registry/AGENTS.md
+++ b/registry/AGENTS.md
@@ -404,7 +404,10 @@ The registry is built in the `apache/airflow` repo and 
served at `airflow.apache
    Supports two modes:
    - **Full build** (no `provider` input): extracts all ~99 providers (~12 min)
    - **Incremental build** (`provider=amazon`): extracts one provider (~30s), 
merges
-     with existing data from S3 via `merge_registry_data.py`, then builds the 
full site
+     with existing data from S3 via `merge_registry_data.py`, then builds the 
full site.
+     The S3 sync step excludes per-provider `connections.json` and 
`parameters.json`
+     for non-target providers to avoid overwriting real data with Eleventy's 
empty
+     fallback stubs (Eleventy 3.x `permalink: false` does not work with 
pagination).
 2. **S3 buckets**: `{live|staging}-docs-airflow-apache-org/registry/` (same 
bucket as docs, different prefix)
 3. **Serving**: Apache HTTPD at `airflow.apache.org` rewrites `/registry/*` to 
CloudFront, which serves from S3
 4. **Auto-trigger**: When `publish-docs-to-s3.yml` publishes provider docs, its
diff --git a/registry/README.md b/registry/README.md
index 2d0d354c59f..6b4df5b78eb 100644
--- a/registry/README.md
+++ b/registry/README.md
@@ -327,10 +327,16 @@ it triggers `registry-build.yml` with the provider ID. 
The incremental flow:
    metadata and PyPI stats; `extract_parameters.py` discovers modules for only 
the
    specified provider.
 3. **Merge** — `merge_registry_data.py` replaces the updated provider's 
entries in
-   the downloaded JSON while keeping all other providers intact.
+   the downloaded JSON while keeping all other providers intact. Only global 
files
+   (`providers.json`, `modules.json`) are merged — per-version files like
+   `connections.json` and `parameters.json` are not downloaded from S3.
 4. **Build site** — Eleventy builds all pages from the merged data; Pagefind 
indexes
-   all records.
-5. **S3 sync** — only changed pages are uploaded (S3 sync diffs).
+   all records. Because per-version data only exists for the target provider, 
Eleventy
+   emits empty fallback JSON for other providers' `connections.json` and
+   `parameters.json` API endpoints (see **Known limitation** below).
+5. **S3 sync (selective)** — the main sync excludes per-provider 
`connections.json`
+   and `parameters.json` to avoid overwriting real data with empty stubs. A 
second
+   sync uploads only the target provider's API files.
 6. **Publish versions** — `publish_versions.py` updates 
`api/providers/{id}/versions.json`.
 
 The merge script (`dev/registry/merge_registry_data.py`) handles edge cases:
@@ -338,6 +344,14 @@ The merge script (`dev/registry/merge_registry_data.py`) 
handles edge cases:
 - First deploy (no existing data on S3): uses the single-provider output as-is.
 - Missing modules file: treated as empty.
 
+**Known limitation**: Eleventy's pagination templates generate API files for 
every
+provider in `providers.json`, even when per-version data (connections, 
parameters) only
+exists for the target provider. The templates emit empty fallback JSON
+(`{"connection_types":[]}`) for providers without data. The S3 sync step works 
around
+this with `--exclude` patterns during incremental builds. A proper 
template-level fix
+(skipping file generation) is tracked as a follow-up — `permalink: false` does 
not work
+with Eleventy 3.x pagination templates.
+
 To run an incremental build locally:
 
 ```bash

(airflow) branch fix-registry-incremental-provider-flag updated: Fix incremental builds overwriting provider connection/parameter data on S3

Reply via email to