tl;dr: non-unique distribution names are annoying and create a
security hole on rt.cpan.org.  Fixing it may not be trivial.

## Terminology and context ##

By "distribution", I generally mean the unique path of a CPAN
distribution in the authors/id/X/XY directory.  I may occassionally
refer to this as a "distfile" for specific clarity.  The "distribution
name" is the portion of the basename without version or suffix.

    distribution: DAGOLDEN/Foo-Bar-1.23.tar.gz
    author: DAGOLDEN
    distribution name: Foo-Bar
    version: 1.23
    suffix: .tar.gz

Distributions contain modules (.pm files), which contain packages
(namespaces declared by "package NAME").  PAUSE indexes packages and
associates them with a source distribution.  PAUSE has a system of
permissions for packages and ensures that distributions are unique.

## Background ##

Yesterday, I conducted a test of the CPAN ecosystem by uploading two
distributions:

    * DAGOLDEN/Acme-CPAN-Testers-UNKNOWN-0.03.tar.gz
    * DAGOLDEN/Acme-CPAN-Testers-FAIL-0.02.tar.gz

These were intentionally constructed to have the same "distribution
name" as these existing distributions:

    * BINGOS/Acme-CPAN-Testers-UNKNOWN-0.02.tar.gz
    * BINGOS/Acme-CPAN-Testers-FAIL-0.02.tar.gz

Note that one has the same version number and one does not.  And thank
you to BinGOs for volunteering his Acme modules.

While my distributions had the same "name", they contained completely
different, unindexed packages.  BinGOs and I do not share co-maint on
any of the packages involved.

I observed the following after PAUSE accepted the distributions and
indexed the packages:

(1) metacpan.org and search.cpan.org incorrectly linked my
distributions to BinGOs'.  E.g. they both believe the latest
"Acme-CPAN-Testers-UNKNOWN" is mine, though their contents and primary
maintainers are completely different.

(2) rt.cpan.org treated both distributions as having the same RT
queue. I gained administrative access to BinGOs' existing queues.

(3) cpantesters.org treated both distributions as one for the purpose
of aggregating test reports[1]


## Implications ##

The first observation is probably annoying but not outright dangerous.
 I don't think any installers use latest release data to guess which
tarball to install.  However, one could, for instance upload a
distribution with a duplicate name to a popular package but with a
much higher version number and it could effectively mask the original
for some types of queries or web requests.

The second observation is a security hole.  Anyone can gain
administrative rights to an RT queue simply by uploading a
distribution of the same name.  (I don't know if RT rejects
distributions containing unauthorized package, but it really doesn't
matter as far as threat vectors go.)

The third observation is also annoying but not dangerous.  Someone
could intentionally or unintentionally upload a duplicate distribution
name and pollute existing
test results.

Like many things on CPAN, if we sort of trust everyone to act
decently, we can probably ignore this, just like we ignore all the
*.PL files that run arbitrary code on installation.

However, all of these point to the same underlying flaw: using a
non-unique data element as a unique key.  This creates a common point
of failure.

## Solutions ##

Here's where I start brainstorming.  If we can get some good
discussion on this list, then maybe we could finalize a plan at the QA
hackathon, which will have a number of the relevant
maintainers/administrators attending.

(a) We could do nothing; we've lived with it and can continue to live
with it and will police any incidents on a one-off basis

(b) We could extend PAUSE's permission system distribution names as
well, so that distribution names would have primary/co-maint rights
just as packages do.  This would not fix any existing duplicates, but
would prevent future infractions.  It means changing a lot of PAUSE
code, but would allow RT, search sites and CPAN Testers (CT) to pretty
much remain as is.

(c)  We could restrict PAUSE to allow only "well formed" distribution
names[2] -- ones matching a module name inside containing a package of
a corresponding name.  E.g. "Foo-Bar", containing "Foo/Bar.pm" with
package "Foo::Bar".  The existing package permissions system becomes
the chokepoint to restrict abuse.  Existing distributions with
non-conforming names (e.g. libwww-perl) either change for their next
release or get grandfathered somehow.

(d) RT, search sites and CT stop using distribution name as a key and
revert either to package names or to distfile in some fashion.  This
is not a trivial amount of code change and -- in the case of RT --
might make RT much more complicated and less useful.

(e) We could develop a new, unique way to identify collections of
related packages.  This could be based on some combination of
distribution name and the name of an authorized packages it contains,
or perhaps just on the name of a "primary" package.  RT and the search
sites would need to migrate to a new data model and probably change
their HTTP routes to match.

(f) Something else

I welcome some thoughts and discussion.  Even if the ultimate
conclusion is (a), I think it would be best to select that
intentionally, not default to it through apathy.

-- David

Notes:

[1] The Metabase backend for CT correctly distinguishes reports by
distfile, but this is not yet reflected downstream in reporting.

[2] See 
http://www.dagolden.com/index.php/308/packages-modules-and-distributions/

-- 
David Golden <x...@xdg.me>
Take back your inbox! → http://www.bunchmail.com/
Twitter/IRC: @xdg

Reply via email to