kaxil opened a new pull request, #62261: URL: https://github.com/apache/airflow/pull/62261
Devlist Discussion: https://lists.apache.org/thread/7n4pklzcc4lxtxsy9g69ssffg9qbdyvb A static-site provider registry for discovering and browsing Airflow providers and their modules. Deployed at `airflow.apache.org/registry/` alongside the existing docs infrastructure (S3 + CloudFront). Staging preview: https://airflow.staged.apache.org/registry/ ## What it does The registry indexes all 99 official providers and 840 modules (operators, hooks, sensors, triggers, transfers, bundles, notifiers, secrets backends, log handlers, executors) from the existing `providers/*/provider.yaml` files and source code in this repo. No external data sources beyond PyPI download stats. **Pages:** - **Homepage** — search bar (Cmd+K), stats counters, featured and new providers - **Providers listing** — filterable by lifecycle stage (stable/incubation/deprecated), category, and sort order (downloads, name, recently updated) - **Provider detail** — module counts by type, install command with extras/version selection, dependency info, connection builder, and a tabbed module browser with category sidebar and per-module search - **Explore by Category** — providers grouped into Cloud, Databases, Data Warehouses, Messaging, AI/ML, Data Processing, etc. - **Statistics** — module type distribution, lifecycle breakdown, top providers by downloads and module count - **JSON API** — `/api/providers.json`, `/api/modules.json`, per-provider endpoints for modules, parameters, and connections **Connection Builder** — pick a connection type (e.g. `aws`, `redshift`), fill in the form fields with placeholders and sensitivity markers, and export as URI, JSON, or environment variable format. Fields are extracted from provider.yaml connection metadata. <!-- Upload screenshots here: homepage.png, providers-list.png, provider-detail-amazon.png, connection-builder.png, explore-categories.png, stats.png, module-browser.png, provider-informatica-incubation.png --> <!-- Dark mode variants: homepage-dark.png, providers-list-dark.png, etc. --> ## Architecture ``` provider.yaml + source code (providers/*/) │ ▼ extract_metadata.py ← AST-parses Python files, fetches PyPI stats │ ▼ registry/src/_data/ ├── providers.json ← 99 providers with metadata, quality scores ├── modules.json ← 840 modules with import paths, docstrings └── search-index.json ← Pagefind custom records │ ▼ Eleventy build ← Generates 2,740 static HTML pages │ ▼ Pagefind postbuild ← Builds search index from custom records │ ▼ S3 sync + CloudFront ← registry-build.yml workflow ``` Four Python extraction scripts run at build time: | Script | What it does | Runs in | |--------|-------------|---------| | `extract_metadata.py` | Parses provider.yaml, AST-parses source for class names/docstrings, fetches PyPI stats and release dates | CI (host Python) | | `extract_versions.py` | Reads older provider versions from git tags | CI (host Python) | | `extract_parameters.py` | Inspects constructor signatures via runtime import | Breeze (needs provider packages installed) | | `extract_connections.py` | Extracts connection form fields from provider.yaml + hook classes | Breeze (needs provider packages installed) | The site itself is vanilla HTML/CSS/JS built with [Eleventy](https://www.11ty.dev/) — no React, no bundler. Search uses Pagefind (client-side, loads lazily on first search interaction). Fonts are self-hosted (Plus Jakarta Sans, JetBrains Mono). ## Design decisions worth calling out **Why AST parsing instead of runtime import?** `extract_metadata.py` runs on the CI host without installing 100+ provider packages. It reads `.py` files and extracts class names, base classes, and docstrings from the AST. This means it works with just `pyyaml` as a dependency. The trade-off: it can't resolve dynamic class definitions or runtime-computed attributes. For the 99 providers currently in the repo, AST parsing captures everything. **Why four separate scripts?** `extract_parameters.py` and `extract_connections.py` need runtime access to provider classes (to inspect `__init__` signatures and call `get_connection_form_widgets()`). They run inside Breeze where all providers are installed. `extract_metadata.py` and `extract_versions.py` only need filesystem access and run on the host. Keeping them separate means the CI workflow can run the fast scripts (metadata) without spinning up Breeze, while parameter/connection extraction is a separate optional step. **Why Eleventy?** Static site generators produce zero-JS pages by default. The registry works without JavaScript — filtering and search are layered on top progressively. Eleventy also has no opinion on frontend frameworks, which keeps the dependency surface small (the lockfile has ~30 packages total). **Path prefix handling:** The site deploys at `/registry/` on airflow.apache.org but runs at `/` during local dev. Eleventy's `pathPrefix` config handles this via the `REGISTRY_PATH_PREFIX` env var. Templates use the `| url` filter, and client-side JS reads `window.__REGISTRY_BASE__` (injected in `base.njk`). **Module filtering:** The extraction script filters classes based on type-specific suffix patterns (e.g. `Operator`, `Hook`, `Sensor` suffixes for their respective types) and base class inheritance. This avoids indexing helper classes, dataclasses, and exceptions that happen to live in operator/hook modules. ## What's NOT included (future work) ## How to test locally ```bash # 1. Extract metadata uv run python dev/registry/extract_metadata.py # 2. Install Node dependencies cd registry && pnpm install # 3. Start dev server at http://localhost:8080 pnpm dev ``` <!-- SPDX-License-Identifier: Apache-2.0 https://www.apache.org/licenses/LICENSE-2.0 --> <img width="1280" height="800" alt="connection-builder-dark" src="https://github.com/user-attachments/assets/7ac3eec0-ce73-483e-b92f-c4c058b48568" /> <img width="1280" height="800" alt="connection-builder" src="https://github.com/user-attachments/assets/39d15d12-624c-4cce-86a7-f7d3028a1230" /> <img width="1280" height="800" alt="explore-categories-dark" src="https://github.com/user-attachments/assets/04500e2d-dc65-4b5c-8869-fb351e5d1a91" /> <img width="1280" height="800" alt="explore-categories" src="https://github.com/user-attachments/assets/3c8c10da-6741-41b5-9da3-9eb437ae27c9" /> <img width="1280" height="800" alt="homepage-dark" src="https://github.com/user-attachments/assets/5043097f-4a15-4df1-9924-96c55ed24266" /> <img width="1280" height="800" alt="homepage" src="https://github.com/user-attachments/assets/33cea9e3-b906-4e4d-a26b-9acf2de38272" /> <img width="1280" height="800" alt="module-browser-dark" src="https://github.com/user-attachments/assets/3cbd41b0-dbf4-4456-b823-95ef32fc8a78" /> <img width="1280" height="800" alt="module-browser" src="https://github.com/user-attachments/assets/60d78c57-3a86-4658-a697-06d81b880b5b" /> <img width="1280" height="800" alt="provider-detail-amazon-dark" src="https://github.com/user-attachments/assets/c9beb13c-72de-4520-bcd3-1d30832edfcb" /> <img width="1280" height="800" alt="provider-detail-amazon" src="https://github.com/user-attachments/assets/0b9d9a0f-fbc2-4173-b96b-259b7cc8d2b4" /> <img width="1280" height="800" alt="providers-list-dark" src="https://github.com/user-attachments/assets/0e8dd3b7-aee1-4604-a97f-8d21429623d3" /> <img width="1280" height="800" alt="providers-list" src="https://github.com/user-attachments/assets/46395130-9ce9-4730-a949-97959165da14" /> <img width="1280" height="800" alt="stats-dark" src="https://github.com/user-attachments/assets/a409f154-cac0-4520-9371-07be1deafe3c" /> <img width="1280" height="800" alt="stats" src="https://github.com/user-attachments/assets/068e5667-a121-4fb9-83e7-950c97d814a9" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
