kaxil opened a new pull request, #67670:
URL: https://github.com/apache/airflow/pull/67670

   ## The problem
   
   The registry records **0 monthly downloads** for 37 of ~101 providers, 
including ones that obviously aren't zero: amazon, http, docker, mysql, ssh. 
The amazon page claims 0 against a real ~9M/month; http is ~22M/month.
   
   ## Root cause
   
   The counts come from `registry-build`, which calls 
`pypistats.org/api/recent` once per package and fans those calls out ~86-wide 
(`fetch_pypi_data_parallel`). pypistats rate-limits per IP, so a chunk of every 
build comes back `429`. `fetch_pypi_downloads` catches the 429, logs a warning, 
and returns `{"weekly": 0, "monthly": 0}`, so whichever packages lose the race 
get written as zero. One wave's build log shows 20 such 429s in a single run. 
Because builds merge incrementally against the previous catalog, a zero written 
once sticks until that same provider is re-fetched successfully, which is why 
roughly a third of the catalog is currently stuck at 0.
   
   Tuning concurrency or adding retries doesn't address this: pypistats only 
has a per-package endpoint, so anything built on it makes N calls, and N 
parallel calls against one rate-limited host is the problem.
   
   ## The fix
   
   Source the counts from ClickHouse's public PyPI dataset 
(`sql-clickhouse.clickhouse.com`, the data behind clickpy.clickhouse.com, 
ultimately the same PyPI download logs pypistats and BigQuery draw from). It 
answers every provider in one query, so there is no burst and nothing to 
rate-limit.
   
   pypistats stays as a per-package fallback for what ClickHouse can't cover: a 
provider published so recently it isn't in the dataset yet, or one that 
genuinely returns zero. That path now runs for a handful of packages at most 
rather than all 86. A guard handles the doubly-failed case: if a build still 
ends up with 0 for a provider that had a real number in the previous catalog, 
it keeps the previous number instead of overwriting a known-good value with a 
spurious zero.
   
   ## Notes for review
   
   - **Window anchoring.** The 7- and 30-day windows anchor on the dataset's 
`max(date)`, not `today()`. The public dataset trails by a few days, so a 
`today()`-relative window silently truncates to the loaded days and undercounts 
(common-ai reads ~30k/week that way versus the correct ~49k/week). `max(date)` 
gives a true rolling window that matches pypistats' `last_week`/`last_month`.
   - **`total` stays 0.** The recent endpoint never provided it and the 
registry has never shown it, so populating it is out of scope here.
   - **Why keep pypistats at all.** The ClickHouse `demo` endpoint is 
community-hosted with no uptime guarantee, so it is the bulk source but not the 
only one. If it is unreachable the build falls back to pypistats and the guard 
keeps prior numbers, so an outage degrades rather than zeroing the catalog.
   
   ## Verification
   
   Against the new path: amazon resolves to ~9.0M/month, http ~21.7M, common-ai 
~183k, all previously 0.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to