Thank you. We can ignore all the "-withoutworldwriteables" ones because PAUSE generated those.
Last time I did this sort of thing (for figuring out how to map CT reports before Metabase), the mapping wasn't too bad and there were only a handful of cases where I had to email authors and find out what distribution was what. (Here is my "override" file that took precedence over other heuristics: https://gist.github.com/dagolden/5161146 ) Usually, they were actual duplicates of the same code, possibly originally unauthorized. I don't I found any that were same name and truly different code. Which is why (a) do nothing could still be an option. No one has done anything malicious to date (that we know of). David On Wed, Mar 13, 2013 at 10:45 PM, kenichi ishigaki <kishig...@gmail.com> wrote: > The following are lists of the duplicated distributions found in > uploads.db (as of 3/13 JST) > > 1) duplicated cpan distributions uploaded by different authors (80 dists) > https://gist.github.com/charsbar/a5c2452128b5fd6e5b69 > > 2) duplicated cpan/backpan distributions uploaded by different authors > (317 dists) > https://gist.github.com/charsbar/e10df3c150f4db9bd2a6 > > 3) duplicated cpan/backpan distributions uploaded by different > authors, or uploaded with different file extensions (668 dists) > https://gist.github.com/charsbar/8db90370168d8c28a504 > > Some of them were uploaded by an unauthorized author (probably by > mistake), but not a few were uploaded by different but authorized > authors who probably forgot to update the version. It may be useful to > check their release date, but not sure if it always works. > > > 2013/3/14 David Golden <x...@xdg.me>: >> tl;dr: non-unique distribution names are annoying and create a >> security hole on rt.cpan.org. Fixing it may not be trivial. >> >> ## Terminology and context ## >> >> By "distribution", I generally mean the unique path of a CPAN >> distribution in the authors/id/X/XY directory. I may occassionally >> refer to this as a "distfile" for specific clarity. The "distribution >> name" is the portion of the basename without version or suffix. >> >> distribution: DAGOLDEN/Foo-Bar-1.23.tar.gz >> author: DAGOLDEN >> distribution name: Foo-Bar >> version: 1.23 >> suffix: .tar.gz >> >> Distributions contain modules (.pm files), which contain packages >> (namespaces declared by "package NAME"). PAUSE indexes packages and >> associates them with a source distribution. PAUSE has a system of >> permissions for packages and ensures that distributions are unique. >> >> ## Background ## >> >> Yesterday, I conducted a test of the CPAN ecosystem by uploading two >> distributions: >> >> * DAGOLDEN/Acme-CPAN-Testers-UNKNOWN-0.03.tar.gz >> * DAGOLDEN/Acme-CPAN-Testers-FAIL-0.02.tar.gz >> >> These were intentionally constructed to have the same "distribution >> name" as these existing distributions: >> >> * BINGOS/Acme-CPAN-Testers-UNKNOWN-0.02.tar.gz >> * BINGOS/Acme-CPAN-Testers-FAIL-0.02.tar.gz >> >> Note that one has the same version number and one does not. And thank >> you to BinGOs for volunteering his Acme modules. >> >> While my distributions had the same "name", they contained completely >> different, unindexed packages. BinGOs and I do not share co-maint on >> any of the packages involved. >> >> I observed the following after PAUSE accepted the distributions and >> indexed the packages: >> >> (1) metacpan.org and search.cpan.org incorrectly linked my >> distributions to BinGOs'. E.g. they both believe the latest >> "Acme-CPAN-Testers-UNKNOWN" is mine, though their contents and primary >> maintainers are completely different. >> >> (2) rt.cpan.org treated both distributions as having the same RT >> queue. I gained administrative access to BinGOs' existing queues. >> >> (3) cpantesters.org treated both distributions as one for the purpose >> of aggregating test reports[1] >> >> >> ## Implications ## >> >> The first observation is probably annoying but not outright dangerous. >> I don't think any installers use latest release data to guess which >> tarball to install. However, one could, for instance upload a >> distribution with a duplicate name to a popular package but with a >> much higher version number and it could effectively mask the original >> for some types of queries or web requests. >> >> The second observation is a security hole. Anyone can gain >> administrative rights to an RT queue simply by uploading a >> distribution of the same name. (I don't know if RT rejects >> distributions containing unauthorized package, but it really doesn't >> matter as far as threat vectors go.) >> >> The third observation is also annoying but not dangerous. Someone >> could intentionally or unintentionally upload a duplicate distribution >> name and pollute existing >> test results. >> >> Like many things on CPAN, if we sort of trust everyone to act >> decently, we can probably ignore this, just like we ignore all the >> *.PL files that run arbitrary code on installation. >> >> However, all of these point to the same underlying flaw: using a >> non-unique data element as a unique key. This creates a common point >> of failure. >> >> ## Solutions ## >> >> Here's where I start brainstorming. If we can get some good >> discussion on this list, then maybe we could finalize a plan at the QA >> hackathon, which will have a number of the relevant >> maintainers/administrators attending. >> >> (a) We could do nothing; we've lived with it and can continue to live >> with it and will police any incidents on a one-off basis >> >> (b) We could extend PAUSE's permission system distribution names as >> well, so that distribution names would have primary/co-maint rights >> just as packages do. This would not fix any existing duplicates, but >> would prevent future infractions. It means changing a lot of PAUSE >> code, but would allow RT, search sites and CPAN Testers (CT) to pretty >> much remain as is. >> >> (c) We could restrict PAUSE to allow only "well formed" distribution >> names[2] -- ones matching a module name inside containing a package of >> a corresponding name. E.g. "Foo-Bar", containing "Foo/Bar.pm" with >> package "Foo::Bar". The existing package permissions system becomes >> the chokepoint to restrict abuse. Existing distributions with >> non-conforming names (e.g. libwww-perl) either change for their next >> release or get grandfathered somehow. >> >> (d) RT, search sites and CT stop using distribution name as a key and >> revert either to package names or to distfile in some fashion. This >> is not a trivial amount of code change and -- in the case of RT -- >> might make RT much more complicated and less useful. >> >> (e) We could develop a new, unique way to identify collections of >> related packages. This could be based on some combination of >> distribution name and the name of an authorized packages it contains, >> or perhaps just on the name of a "primary" package. RT and the search >> sites would need to migrate to a new data model and probably change >> their HTTP routes to match. >> >> (f) Something else >> >> I welcome some thoughts and discussion. Even if the ultimate >> conclusion is (a), I think it would be best to select that >> intentionally, not default to it through apathy. >> >> -- David >> >> Notes: >> >> [1] The Metabase backend for CT correctly distinguishes reports by >> distfile, but this is not yet reflected downstream in reporting. >> >> [2] See >> http://www.dagolden.com/index.php/308/packages-modules-and-distributions/ >> >> -- >> David Golden <x...@xdg.me> >> Take back your inbox! → http://www.bunchmail.com/ >> Twitter/IRC: @xdg -- David Golden <x...@xdg.me> Take back your inbox! → http://www.bunchmail.com/ Twitter/IRC: @xdg