tl;dr: non-unique distribution names are annoying and create a security hole on rt.cpan.org. Fixing it may not be trivial.
## Terminology and context ## By "distribution", I generally mean the unique path of a CPAN distribution in the authors/id/X/XY directory. I may occassionally refer to this as a "distfile" for specific clarity. The "distribution name" is the portion of the basename without version or suffix. distribution: DAGOLDEN/Foo-Bar-1.23.tar.gz author: DAGOLDEN distribution name: Foo-Bar version: 1.23 suffix: .tar.gz Distributions contain modules (.pm files), which contain packages (namespaces declared by "package NAME"). PAUSE indexes packages and associates them with a source distribution. PAUSE has a system of permissions for packages and ensures that distributions are unique. ## Background ## Yesterday, I conducted a test of the CPAN ecosystem by uploading two distributions: * DAGOLDEN/Acme-CPAN-Testers-UNKNOWN-0.03.tar.gz * DAGOLDEN/Acme-CPAN-Testers-FAIL-0.02.tar.gz These were intentionally constructed to have the same "distribution name" as these existing distributions: * BINGOS/Acme-CPAN-Testers-UNKNOWN-0.02.tar.gz * BINGOS/Acme-CPAN-Testers-FAIL-0.02.tar.gz Note that one has the same version number and one does not. And thank you to BinGOs for volunteering his Acme modules. While my distributions had the same "name", they contained completely different, unindexed packages. BinGOs and I do not share co-maint on any of the packages involved. I observed the following after PAUSE accepted the distributions and indexed the packages: (1) metacpan.org and search.cpan.org incorrectly linked my distributions to BinGOs'. E.g. they both believe the latest "Acme-CPAN-Testers-UNKNOWN" is mine, though their contents and primary maintainers are completely different. (2) rt.cpan.org treated both distributions as having the same RT queue. I gained administrative access to BinGOs' existing queues. (3) cpantesters.org treated both distributions as one for the purpose of aggregating test reports[1] ## Implications ## The first observation is probably annoying but not outright dangerous. I don't think any installers use latest release data to guess which tarball to install. However, one could, for instance upload a distribution with a duplicate name to a popular package but with a much higher version number and it could effectively mask the original for some types of queries or web requests. The second observation is a security hole. Anyone can gain administrative rights to an RT queue simply by uploading a distribution of the same name. (I don't know if RT rejects distributions containing unauthorized package, but it really doesn't matter as far as threat vectors go.) The third observation is also annoying but not dangerous. Someone could intentionally or unintentionally upload a duplicate distribution name and pollute existing test results. Like many things on CPAN, if we sort of trust everyone to act decently, we can probably ignore this, just like we ignore all the *.PL files that run arbitrary code on installation. However, all of these point to the same underlying flaw: using a non-unique data element as a unique key. This creates a common point of failure. ## Solutions ## Here's where I start brainstorming. If we can get some good discussion on this list, then maybe we could finalize a plan at the QA hackathon, which will have a number of the relevant maintainers/administrators attending. (a) We could do nothing; we've lived with it and can continue to live with it and will police any incidents on a one-off basis (b) We could extend PAUSE's permission system distribution names as well, so that distribution names would have primary/co-maint rights just as packages do. This would not fix any existing duplicates, but would prevent future infractions. It means changing a lot of PAUSE code, but would allow RT, search sites and CPAN Testers (CT) to pretty much remain as is. (c) We could restrict PAUSE to allow only "well formed" distribution names[2] -- ones matching a module name inside containing a package of a corresponding name. E.g. "Foo-Bar", containing "Foo/Bar.pm" with package "Foo::Bar". The existing package permissions system becomes the chokepoint to restrict abuse. Existing distributions with non-conforming names (e.g. libwww-perl) either change for their next release or get grandfathered somehow. (d) RT, search sites and CT stop using distribution name as a key and revert either to package names or to distfile in some fashion. This is not a trivial amount of code change and -- in the case of RT -- might make RT much more complicated and less useful. (e) We could develop a new, unique way to identify collections of related packages. This could be based on some combination of distribution name and the name of an authorized packages it contains, or perhaps just on the name of a "primary" package. RT and the search sites would need to migrate to a new data model and probably change their HTTP routes to match. (f) Something else I welcome some thoughts and discussion. Even if the ultimate conclusion is (a), I think it would be best to select that intentionally, not default to it through apathy. -- David Notes: [1] The Metabase backend for CT correctly distinguishes reports by distfile, but this is not yet reflected downstream in reporting. [2] See http://www.dagolden.com/index.php/308/packages-modules-and-distributions/ -- David Golden <x...@xdg.me> Take back your inbox! → http://www.bunchmail.com/ Twitter/IRC: @xdg