kaxil opened a new pull request, #67670:
URL: https://github.com/apache/airflow/pull/67670
## The problem
The registry records **0 monthly downloads** for 37 of ~101 providers,
including ones that obviously aren't zero: amazon, http, docker, mysql, ssh.
The amazon page claims 0 against a real ~9M/month; http is ~22M/month.
## Root cause
The counts come from `registry-build`, which calls
`pypistats.org/api/recent` once per package and fans those calls out ~86-wide
(`fetch_pypi_data_parallel`). pypistats rate-limits per IP, so a chunk of every
build comes back `429`. `fetch_pypi_downloads` catches the 429, logs a warning,
and returns `{"weekly": 0, "monthly": 0}`, so whichever packages lose the race
get written as zero. One wave's build log shows 20 such 429s in a single run.
Because builds merge incrementally against the previous catalog, a zero written
once sticks until that same provider is re-fetched successfully, which is why
roughly a third of the catalog is currently stuck at 0.
Tuning concurrency or adding retries doesn't address this: pypistats only
has a per-package endpoint, so anything built on it makes N calls, and N
parallel calls against one rate-limited host is the problem.
## The fix
Source the counts from ClickHouse's public PyPI dataset
(`sql-clickhouse.clickhouse.com`, the data behind clickpy.clickhouse.com,
ultimately the same PyPI download logs pypistats and BigQuery draw from). It
answers every provider in one query, so there is no burst and nothing to
rate-limit.
pypistats stays as a per-package fallback for what ClickHouse can't cover: a
provider published so recently it isn't in the dataset yet, or one that
genuinely returns zero. That path now runs for a handful of packages at most
rather than all 86. A guard handles the doubly-failed case: if a build still
ends up with 0 for a provider that had a real number in the previous catalog,
it keeps the previous number instead of overwriting a known-good value with a
spurious zero.
## Notes for review
- **Window anchoring.** The 7- and 30-day windows anchor on the dataset's
`max(date)`, not `today()`. The public dataset trails by a few days, so a
`today()`-relative window silently truncates to the loaded days and undercounts
(common-ai reads ~30k/week that way versus the correct ~49k/week). `max(date)`
gives a true rolling window that matches pypistats' `last_week`/`last_month`.
- **`total` stays 0.** The recent endpoint never provided it and the
registry has never shown it, so populating it is out of scope here.
- **Why keep pypistats at all.** The ClickHouse `demo` endpoint is
community-hosted with no uptime guarantee, so it is the bulk source but not the
only one. If it is unreachable the build falls back to pypistats and the guard
keeps prior numbers, so an outage degrades rather than zeroing the catalog.
## Verification
Against the new path: amazon resolves to ~9.0M/month, http ~21.7M, common-ai
~183k, all previously 0.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]