On 5 May 2017 at 14:10, Gregory P. Smith <g...@krypto.org> wrote:
> This is not a solvable problem. IMNSHO We should never attempt to implement
> pre screening of packages.
> It is a good post-package-upload task for someone to try and do as a
> research project.
> Automated code scanning can only find already known things and similar
> signatures (at which point it can have false positives) and we aren't just
> talking about obfuscated source code. PyPI hosts binary wheels made using
> unreproduceable build processes on untrusted machines created from
> unverifiable inputs. Scanning services such as Google's
> https://www.virustotal.com/en/about/ exist but I'm not sure that'd be of
> much value to PyPI.
Red Hat's approach to this (https://github.com/fabric8-analytics/)
relies heavily on "popularity within your cohort" as a proxy for
safety. It's far from being a perfect approach (since there's still a
risk of the "bystander effect" coming into play, where everyone
assumes everyone else is handling the security audits), but it at
least gives people a heads up when they're doing something relatively
unusual and hence may want to take more care and treat their potential
dependencies with a bit more suspicion.
P.S. Full disclosure: until I switched teams a few months ago, working
on fabric8-analytics (and its precursor projects) was my day job at
Red Hat. As far as I'm aware, the current version still doesn't take
the raw PyPI Big Query download data into account, but it does track
component usage across public GitHub repositories - the benefit of
focusing on the latter is that it gives you co-occurence information
(i.e. "component X is often used in combination with component Y"),
rather than the raw popularity metrics offered by the download numbers
(which can also be heavily skewed by artifact caches, and the lack
thereof, in automated build and test pipelines).
Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia
PSF-Community mailing list