Just an update, asyncmongo has released to PyPI now, so I’ve removed
them from the gists as well. Still no word back from PIL.

On May 18, 2014, at 11:21 AM, Donald Stufft <[email protected]> wrote:

> 
> On May 18, 2014, at 2:20 AM, holger krekel <[email protected]> wrote:
> 
>> On Sat, May 17, 2014 at 20:20 -0400, Donald Stufft wrote:
>>> On May 17, 2014, at 1:51 PM, holger krekel <[email protected]> wrote:
>>> 
>>>> On Sat, May 17, 2014 at 11:32 -0400, Donald Stufft wrote:
>>>>> More conclusions!
>>>>> 
>>>>> In that same time period PyPI received a total of ~16463209 hits to a 
>>>>> page on
>>>>> the simple installer API. This means that in total these projects 
>>>>> represent
>>>>> a combined 0.56% of the simple installer traffic on PyPI. However looking 
>>>>> at
>>>>> the numbers you can see that PIL is an obvious outlier with the hits 
>>>>> dropping
>>>>> drastically after that. PIL on it's own represents 0.44% of the hits on 
>>>>> PyPI
>>>>> during that time period leaving only 0.12% for anything not PIL.
>>>> 
>>>> So the current numbers roughly mean that around 92193 end-user sites per
>>>> day depend on crawling currently, right?  Do you know if these are also
>>>> unique IPs (they might indicate duplicates although companies also have 
>>>> NATting
>>>> firewalls)?
>>>> 
>>>> holger
>>> 
>>> Here’s the number of IP addresses that accessed each /simple/ page per day.
>>> 
>>> https://gist.github.com/dstufft/347112c3bcc91220e4b2
>>> 
>>> Unique IPs: 95541
>>> Unique IPs for Only Hosted off PyPI: 8248 (8.63%)
>>> Unique IPs for Only Hosted off PyPI w/o PIL: 2478 (2.59%)
>>> 
>>> It's important to remember when looking at these numbers that almost all of
>>> them represent something downloading a package unsafely which will generally
>>> contain Python code that they will then be executed. Breaking the unsafe 
>>> thing
>>> is, in my opinion, non optional and the only thing needed to be discussed 
>>> about
>>> it is how to go about doing it exactly. The safe thing I think *should* be
>>> removed for the various other reasons that have been outlined and it only
>>> represents a tiny fraction of uses.
>>> 
>>> The numbers to be specific are, 8248 of the above 8248 IPs downloaded 
>>> something
>>> unsafely, while 214 of them also downloaded something safely. That means 
>>> that
>>> 100% of the 8248 addresses could have been attacked through their use of 
>>> PyPI
>>> and only 2.59% downloaded anything that was safely hosted off of PyPI.
>>> 
>>> Looking at the same numbers for projects which have *any* files hosted off 
>>> of
>>> PyPI (the numbers thus far have been projects which have *only* files hosted
>>> off of PyPI) I see that 35046 IP addresses accessed a project that had any
>>> unsafely hosted off of PyPI files while only 2852 IP addresses accessed a
>>> project that had any safely hosted off of PyPI files.
>>> 
>>> That means that roughly a minimum floor of ~36% of the users of PyPI were
>>> vulnerable to a MITM attack on 2014-05-14 unless they were using pip 1.5
>>> without any --allow-unverified flags or they were using pip 1.4 with
>>> --allow-no-insecure and even in that case they could still be vulnerable if
>>> there is any use of setup_requires. I say that's a minimum because that only
>>> counts the projects where I happened to find a file hosted unsafely 
>>> externally.
>>> It does not count at all any projects which I did not find a file like that 
>>> but
>>> which still has locations on their simple page like that. This is especially
>>> troublesome for projects where they have old domain names in those links 
>>> that
>>> point to domains that are no longer registered.
>>> 
>>> Also just FYI I've removed pyPDF from both lists as I've contacted the 
>>> author
>>> and there are packages now hosted on PyPI for it. I've also contacted PIL 
>>> and a
>>> few other authors (of which I've just heard back from cx_Oracle and they 
>>> appear
>>> to be willing to upload as well).
>> 
>> Thanks Donald for both the numbers and contacting some key authors which
>> i think is a very good move!  I suggest to now wait a week or so to see
>> where we stand then, update the numbers and then try to settle on
>> crawl-deprecation paths.
>> 
>> Also, let's please just talk about "checksummed" packages or integrity.  
>> Even all pypi hosted packages are unsafe in the sense that they 
>> might contain bad code from malicious uploaders or http-interceptors 
>> that executes on end-user machines during installation.  Thus the term
>> "safe" is misleading and should not be used when communicating to
>> end-users.  Currently, we can only say or improve anything related to
>> integrity: what people download is what was uploaded by whoever happened
>> to have the credentials (*) or MITM access on http upload.  Speaking of the
>> latter, maybe we should also think about moving to https uploads and
>> certificate-pinning, and that also for installers.  And also, as Marius
>> pointed out, pypi is currently using the relatively weak MD5 hash.
> 
> The problem with upload is when people use setup.py upload they are often 
> times
> using the upload from distutils. Since that is in the standard library we 
> can't
> really go backwards in time and make it safe. People who use my twine utility
> to upload instead of setup.py upload are not vulnerable to MITM on upload.
> 
> While I don't particularly like the MD5 hash, it's not true that the MD5 hash
> current presents a problem against the threat model that we're worried about.
> It's relatively easy to generate a collision attack, which would mean that a
> malicious author could generate two packages, an unsafe and a safe one that
> hashed to the same thing. However MD5 is still resistant to 2nd preimage
> attacks so an attacker could not create a package that hashes to a given hash.
> 
>> 
>> Without resolving these issues we can not even truthfully declare
>> integrity as something that the pypi-hosted packages themselves are 
>> providing.
> 
> We cannot fix every problem at once. Right now the tools exist for authors to
> make it possible to do everything safely. The externally hosted files 
> represent
> an easier to exploit attack than a MITM on author upload. The MITM requires a
> privileged network position on specific individuals whom are also not using
> twine or the browser to upload their distributions.
> 
> Attacking people who are installing these packages is far easier. It would
> either require a privileged network position on one of ~90k IP addresses on 
> any
> particular day (a much easier feat than for authors periodically) or, even
> easier, locate an expired domain registration and simply register the domain
> which wouldn't require a privileged network position at all.
> 
>> 
>> best,
>> holger
>> 
>> (*) did you happen to have run some password crackers against
>> the pypi database?  Might be a larger attack vector than highjacking
>> DNS entries.
> 
> No I have not. The database currently uses bcrypt with a work factor of 12
> which makes it computationally hard for me to brute force passwords for all
> ~30k users which have a password set. If there was a specific user I was
> interested in a smart brute force attack might be able to locate something.
> Rate-limiting log in attempts is also on the list of things to add in
> Warehouse.
> 
> -----------------
> Donald Stufft
> PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
> 
> _______________________________________________
> Distutils-SIG maillist  -  [email protected]
> https://mail.python.org/mailman/listinfo/distutils-sig


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Distutils-SIG maillist  -  [email protected]
https://mail.python.org/mailman/listinfo/distutils-sig

Reply via email to