How to avoid confusables. These scripts are recommended for use in identifiers: http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
This report details a confusables detection algorithm: http://www.unicode.org/reports/tr39/#Confusable_Detection And ICU implements it: http://www.icu-project.org/apiref/icu4c/uspoof_8h.html (see also PyICU). The package index would enforce uniqueness of the "skeleton" of each registered package which is just an internal normalization based on confusability. if skeleton(identifier1) == skeleton(identifier2) then id1 and id2 are confusable. The tooling could get away with a simpler rule like re.sub("[^\w\d.]+", "_", distribution, re.UNICODE) As a bonus to including the world, this should be able to prevent people from exchanging zeroes for capital O. On Wed, May 15, 2013 at 7:17 AM, Eric V. Smith <[email protected]> wrote: > On 05/15/2013 07:10 AM, Donald Stufft wrote: >>>>> Anyone want to run a scan over the PyPI package set to see >>>>> how many packages would cause problems for a "[a-zA-Z0-9_.-]" >>>>> only filter? >>>> >>>> See my previous email where I did queries against my local DB. >>>> It's 225 total projects that wouldn't be allowed. >>> >>> Can you send the list of those projects? >>> >>> Eric. >>> >> >> Here you go https://gist.github.com/dstufft/5583225 used a Python >> oneliner and the PyPI API so others can reproduce easily if they >> wish. > > Perfect. Thanks. > > It looks like space causes most of the issues. I'm not sure how > "Twisted Flow >= 1.0" would be expected to parse. > > Eric. > > > _______________________________________________ > Distutils-SIG maillist - [email protected] > http://mail.python.org/mailman/listinfo/distutils-sig _______________________________________________ Distutils-SIG maillist - [email protected] http://mail.python.org/mailman/listinfo/distutils-sig
