FTR, I don't consider the top projects on PyPI to be representative of our user base, and *especially* not representative of people compiling native modules.

This is not a good way to evaluate the impact of breaking changes.

It would be far safer to assume that every change is going to break someone and evaluate:
* how will they find out that upgrading Python will cause them to break
* how will they find out where that break occurs
* how will they find out how to fix it
* how will they manage that fix across multiple releases
* how will we explain that upgrading and fixing breaks is better for *them* than staying on the older version

This last one is particularly important, as many large organisations (anecdotally) seem to have settled on Python 3.7 for a while now. Inevitably, this means they're all going to be faced with a painful time when it comes to an upgrade, and every little change we add on is going to hurt more. Every extra thing that needs fixing is motivation to just rewrite in a new language with more hype (and the promise of better compatibility... which I won't comment specifically on, but I suspect they won't manage it any better than us ;) ).

This is not the case for the top PyPI projects. They incrementally update and crowdsource fixes, often from us. The pain is distributed to the level of permanent background noise, which sucks in its own way, but is ultimately not representative of much of our user base.

So by all means, use this tool for checking stuff. But it's not a substitute for justifying every incompatible change in its own right.

/rant

Cheers,
Steve

On 12/2/2021 11:44 PM, Victor Stinner wrote:
Hi,

I wrote two scripts based on the work of INADA-san's work to (1)
download the source code of the PyPI top 5000 projects (2) search for
a regex in these projects (compressed source archives).

You can use these tools if you work on an incompatible Python or C API
change to estimate how many projects are impacted.

The HPy project created a Git repository for a similar need (latest
update in June 2021):
https://github.com/hpyproject/top4000-pypi-packages

There are also online services for code search:

* GitHub: https://github.com/search
* https://grep.app/ (I didn't try it yet)
* Debian: https://codesearch.debian.net/


(1) Dowload

Script:
https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py

Usage: download_pypi_top.py PATH

It uses this JSON file:
https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json

 From this service:
https://hugovk.github.io/top-pypi-packages/

At December 1, on 5000 projects, it only downloads 4760 tarball and
ZIP archives: I guess that 240 projects don't provide a source
archive. It takes around 5,2 GB of disk space.


(2) Code search

First, I used the fast and nice "ripgrep" tool with the command "rg
-zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
archives). But it doesn't show the path inside the archive and it
searchs in files generated by Cython whereas I wanted to ignore these
files.

So I wrote a short Python script which decompress tarball and ZIP
archive in memory and looks for a regex:
https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py

Usage: search_pypi_top.py "REGEX" output_filename

The code to parse command line option is hardcoded and pypi_dir =
"PYPI-2021-12-01-TOP-5000" are hardcoded :-D

It ignores files generated by Cython and .so binary files (Linux
dynamic libraries).

While "rg" is very fast, my script is very slow. But I don't care,
once the regex is written, I only need to search for the regex once, I
can wait 10-15 min ;-) I prefer to wait longer and have a more
accurate result. Also, there is room for enhancement, like running
multiple jobs in different processes or threads.

Victor

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HO7PS57UCJPJLON2BJPPEBM7I3Q6AM2U/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to