On 3/13/13 2:31 PM, David Golden wrote:
> tl;dr: non-unique distribution names are annoying and create a
> security hole on rt.cpan.org.  Fixing it may not be trivial.

+1

"Distributions", releases of a single project, are largely informal
entities yet they're basic CPAN structure.  It would be good to
normalize and formalize them.


> ## Terminology and context ##
> 
> By "distribution", I generally mean the unique path of a CPAN
> distribution in the authors/id/X/XY directory.  I may occassionally
> refer to this as a "distfile" for specific clarity.  The "distribution
> name" is the portion of the basename without version or suffix.
> 
>     distribution: DAGOLDEN/Foo-Bar-1.23.tar.gz
>     author: DAGOLDEN
>     distribution name: Foo-Bar
>     version: 1.23
>     suffix: .tar.gz
> 
> Distributions contain modules (.pm files), which contain packages
> (namespaces declared by "package NAME").  PAUSE indexes packages and
> associates them with a source distribution.  PAUSE has a system of
> permissions for packages and ensures that distributions are unique.

I'd suggest that what you describe is a *release* of a distribution.

Here's how BackPAN::Index does it...

A distribution has a name and a list of releases.

    distribution:
        id: Foo-Bar
        releases:
            - DAGOLDEN/Foo-Bar-1.23.tar.gz
            - DAGOLDEN/Foo-Bar-1.22.tar.gz
            - MORBO/Foo-Bar-1.00.tar.gz

It is effectively "the project" and may make more sense to call it
"project" to avoid ambiguity over "distribution".

Releases have a file (which is the same as the identifier, but does not
have to be), an author (really "releaser"), a version and a
distribution.  They have other stuff, but this is enough to get the
basic release vs distribution relationship.

    release:
        id: DAGOLDEN/Foo-Bar-1.23.tar.gz
        releaser: DAGOLDEN
        version: 1.23
        distribution: Foo-Bar
        provides:
            "Foo::Bar": 1.23,
            "Foo::Bar::Baz": 1.23

Currently the release contains most of the meta information about the
distribution such as mailing list, stability, contact info and version
control.  It may make sense to move the formal information of project
meta data into the distribution, but keep the mechanism for updating it
to include it with the latest release.  Effectively, most of the project
meta data is aliased to the latest release.


> I observed the following after PAUSE accepted the distributions and
> indexed the packages:
> 
> (1) metacpan.org and search.cpan.org incorrectly linked my
> distributions to BinGOs'.  E.g. they both believe the latest
> "Acme-CPAN-Testers-UNKNOWN" is mine, though their contents and primary
> maintainers are completely different.
> 
> (2) rt.cpan.org treated both distributions as having the same RT
> queue. I gained administrative access to BinGOs' existing queues.
> 
> (3) cpantesters.org treated both distributions as one for the purpose
> of aggregating test reports[1]

These are good observations about how (not) easy it is to get permission
information out CPAN/PAUSE.


> ## Solutions ##
> 
> Here's where I start brainstorming.  If we can get some good
> discussion on this list, then maybe we could finalize a plan at the QA
> hackathon, which will have a number of the relevant
> maintainers/administrators attending.
> 
> (a) We could do nothing; we've lived with it and can continue to live
> with it and will police any incidents on a one-off basis

In several projects (Gitpan and BackPAN::Index being two) I've found
that putting together what a "distribution" is either very difficult to
get correct, or you live with a high amount of broken distribution
lists.  It would make working with CPAN much easier if discovering what
a distribution is and their releases was easy and correct.

Which is to say, this is not just a security problem.  The cost of our
messy concept of distributions is a barrier to doing interesting things
with CPAN.


> (b) We could extend PAUSE's permission system distribution names as
> well, so that distribution names would have primary/co-maint rights
> just as packages do.  This would not fix any existing duplicates, but
> would prevent future infractions.  It means changing a lot of PAUSE
> code, but would allow RT, search sites and CPAN Testers (CT) to pretty
> much remain as is.

+1

IMO this is a necessary piece of missing CPAN meta data which everyone
else has to piece together again and again.

We could also retroactively fix duplicates as they are reported once and
for all.


> (c)  We could restrict PAUSE to allow only "well formed" distribution
> names[2] -- ones matching a module name inside containing a package of
> a corresponding name.  E.g. "Foo-Bar", containing "Foo/Bar.pm" with
> package "Foo::Bar".  The existing package permissions system becomes
> the chokepoint to restrict abuse.

-1  I don't think this is necessary if B is in place, and B is a much
better solution.

We've always had a policy of being very liberal with what we allow and
not everything is a Perl library, PAUSE will not try to index it.  I'm
ok with that.

If you have something which falls outside the normal structure, for some
reason Foo-Bar-X.YZ.tar.gz doesn't have a lib/Foo/Bar.pm, the meta data
would be trusted.  If it says its release X.YZ of the Foo-Bar
distribution then it is.  The permissions system in B protects the rest
and a permissions/distribution API lets external sites query it.


> Existing distributions with
> non-conforming names (e.g. libwww-perl) either change for their next
> release or get grandfathered somehow.

I'm happy to grandfather in existing major packages, especially major
ones like libwww-perl where people have learned to look for
libwww-perl-X.YZ.tar.gz and not LWP-X.YZ.tar.gz


> (d) RT, search sites and CT stop using distribution name as a key and
> revert either to package names or to distfile in some fashion.  This
> is not a trivial amount of code change and -- in the case of RT --
> might make RT much more complicated and less useful.

-1

The distribution name is still a good identifier and I'd rather see the
distribution meta problem solved.


> (e) We could develop a new, unique way to identify collections of
> related packages.  This could be based on some combination of
> distribution name and the name of an authorized packages it contains,
> or perhaps just on the name of a "primary" package.  RT and the search
> sites would need to migrate to a new data model and probably change
> their HTTP routes to match.

-1

Sounds complicated and unnecessary.  Its hard to express for humans.
The set of authorized packages and who did the release changes from
release to release.  Colons give some filesystems (OS X) indigestion.

Reply via email to