[proposal] repository URI format

Leo Simons 8 Mar 2003 14:35:47 -0000

Hi all,

just read what's in the archive until now. I've summarized (well, not
really summarized, more elongated) the discussions up till now, added
my own thoughts, done some reasearch, and then I came to a conclusion. I
suggest y'all rip this apart, put it together again, (it's in
wiki-compatible format :D) and then someone tallies a vote on whatever
list is appropriate.

cheers,

- Leo

        = THE URI FORMAT FOR A SOFTWARE ARTIFACT REPOSITORY =

= Conclusion =

I'll provide my conclusion first, as this is rather a lot of text :D

When the following is known, the URI for any software distribution
architecture is uniquely specified:

* <FQDN>         - fully qualified domain name of the repository as
                   defined in the URL spec
* <protocol>     - <scheme> as defined in the URI spec
* <base>         - base directory on the machine identified by <FQDN>
                   (probably relative to documentroot), preferably
                   consisting of lowercase letters, dashes and slashes
* <organisation> - the inverse of the domain name of the organisation
                   that produces the artifact
* <project>      - the division/group within the organisation that
                   produces the artifact, preferably consisting of
                   lowercase letters, dashes and slashes, with a website
                   at http://<project>.<organisation>/
* <name>         - the name of the artifact (unique within the
                   <project>), preferably consisting of lowercase
                   letters and dashes
* <type>         - the filetype of the artifact, consisting of whatever
                   part of the artifact filename normally identifies the
                   filetype
* <version>      - the version of the artifact, consisting of any set of
                   characters allowed in an URI, augmented with any
                   information about software or hardware platform
                   requirements if not normally part of the version,
                   preferably consiting of numbers, letters, dashes and
                   points

and for various reasons detailed below I think the URI should be
composed based on the above as follows:

<protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>

= Goal of the Discussion =

The goal of this part of the discussion is to define additional
constraints/ guidelines above and beyond the URI specification to make
it possible to uniquely and unambiguously define the location of a
software distribution artifact, when the following are known:

* the /FQDN/ of the repository (ie repo.apache.org)
* the /protocol/ used to access the repository (ie http)
* the /base/ directory of the repository (ie /dist/repository or /)
* the /organisation/ that produces the artifact
* the /project/ within the organisation that produces the artifact
* the /name/ of the artifact
* the distribution /type/ of the artifact
* the /version/ of the artifact

= Requirements =

* The URI should be stable
* The URI should be easy to generate by humans and machines when
  the above listed items are known
* The URI should be unique based on the uniqueness of the above
  listed items, ie:

if ! otherURI.FQDN == thisURI.FQDN
   return false
elseif ! otherURI.protocol == thisURI.protocol
   return false
elseif ! otherURI.basedir == thisURI.basedir
   return false
elseif ! otherURI.organisation == thisURI.organisation
   return false
elseif ! otherURI.project == thisURI.project
   return false
elseif ! otherURI.name == thisURI.name
   return false
elseif ! otherURI.version == thisURI.version
   return false
else
   return true

* the part of the URI not containing FQDN, protocol and basedir should
  be common across repositories, ie it is desirable that an artifact
  identified by

** the organisation that produces the artifact
** the project within the organisation that produces the artifact
** the name of the artifact
** the version of the artifact

  can be found on any repository by substituting the repository FQDN,
  protocol and basedir from the current URI

= Proposals =

== Base Identifaction conventions ==

base will often be "", but in the case of mirrors mirroring many
repositories (ie ibiblio), that might be impractical, in which case
I suggest the base is whatever maps to a directory on the filesystem
the repository is using (ie whatever ext2/3, fat32, whatever accepts
as a directory identifier).

== Organisation Identification conventions ==

It has been suggested that the identification of the organisation is
done by reverse domain names, ie "org.apache", "org.sun" and "com.ibm".

It has also been suggested that the organisation is not identified
seperately (ie as is current practice on http://www.ibiblio.org/maven/).

== Project Identification conventions ==

It has been suggested that the identification of a project is done by
lowercase letters seperated by dashes, ie jakarta-commons.

I have seen no suggestions as to how the apache project sturcture should
map into the project names in the repository, IOW, is the project part
of commons-logging.jar to be "jakarta", "jakarta-commons", or
"jakarta-commons-logging"? My suggestion is that the project structure
mapping is based on top-level-projects (ie *.apache.org), so the answer
to that question is "jakarta".

In the context of sourceforge, the project identifaction would map
similarly, ie the convention of ${projectname}.${host}.org would lead to
project names of "jboss", "jedit", etc. Hence this sounds like a smart
mapping to me.

== Artifact Naming conventions ==

It has been suggested that the name of the artifact is to be determined
by the project providing the artifact, so that the "jakarta" project
determines what artifact name it will associate with the subsubproject
http://jakarta.apache.org/commons/logging. Of course, a project could
choose to delegate such a choice to a subproject or subsubproject; I
suggest we do not try and define who makes the artifact name choice
within a project :D

It has been suggested that the name of the artifact is to be comprised
of lowercase letters seperated by dashes, ie commons-logging.

== Versioning conventions ==

I have seen no suggestions with regard to versioning. I assume everyone
agrees that the format of a version is determined by a project, though
the recommended practice is that a version is comprised of numbers
seperated by dashes and dots, and optionally containing lowercase
letters identifying part of the development cycle, ie

* 1.0
* 1.0a
* 1.0-alpha
* 1.0-alpha-1
* 08032003
* 03082003
* 2003-03-08
* SNAPSHOT-03.08.2003

are all acceptable, and the choice is made to conform to the versioning
number used by whomever supplies the artifact.

== Distribution type conventions ==

It has been suggested that a distribution type is defined by its
three-letter acronym, in lowercase, ie:

jar
war
ear
rpm
tgz
zip

I have not seen other suggestions. I myself suggest a distribution type
is identified by whatever filename component normally represents the
distribution type for a given artifact distribution, ie common types
would be:

jar
war
rpm
tar.gz
tgz
zip

where the use of tar.gz versus the use of tgz depends on the convention
used by the authoritative distributor of the artifact (ie for apache
httpd, the files are provied as .tar.gz, so the distribution type is
tar.gz and not tgz).

== The URI format ==

(refer to http://www.ietf.org/rfc/rfc2396.txt; note we can make the
assumption <protocol> == <scheme>)

Adopting the convention <thing> to identify the parts of the URI, I
have seen the following suggestions:

* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<version>/<artifact> * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact> * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact> * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact> * <protocol>://<FQDN>/<base><name>/<type>s/<artifact>

where proposals for the format of <artifact> can be any of

* <artifact> = <name>-<version>.<type>
* <artifact> = <name>.<type>
* <artifact> = ANY_VALID_URI_CHARACTERS
* <artifact> = <name>-<version>.<type> | <name>.<type>

=== The current maven repository format ===

Maven uses two different setups:

<artifact> = <name>-<version>.<type>
<protocol>://<FQDN>/<base><name>/<type>s/<artifact>

if <type> == jar and

<artifact> = <name>-<version>.<type>
<protocol>://<FQDN>/<base><name>/distributions/<artifact>

if <type> == zip || <type> == tar.gz

I don't think it provides other <type>s in its repo atm.

=== How to choose a format ===

I think we should start with taking into account

if ! otherURI.FQDN == thisURI.FQDN
   return false
elseif ! otherURI.protocol == thisURI.protocol
   return false
elseif ! otherURI.basedir == thisURI.basedir
   return false
elseif ! otherURI.organisation == thisURI.organisation
   return false
elseif ! otherURI.project == thisURI.project
   return false
elseif ! otherURI.name == thisURI.name
   return false
elseif ! otherURI.version == thisURI.version
   return false
else
   return true

And once we have that settled, we should choose a layout which does
not duplicate information, in order to keep the URI short, ie I cannot
see why it is a good idea to specify (.*)<version>(.*)<version>(.*) for
putting the version in the URI.

The next choice is between

* <artifact> = <name>-<version>.<type>
* <artifact> = <name>.<type>
* <artifact> = ANY_VALID_URI_CHARACTERS
* <artifact> = <name>-<version>.<type> | <name>.<type>

and when that is settled we can determine the rest of the URI.

Note that the choice of <artifact> is important, as this is what most
applications will provide as the normal name for the user to save the
files.

=== My case for <artifact> ===

The advantage of ANY_VALID_URI_CHARACTERS is that it reduces the need
for renaming of files when included in the repository: one can just use
the same filename as provided by the original artifact distributor.

The big disadvantage is that this doesn't satisfy the requirment that an
URI should be identified as detailed below: you need to know <artifact>
in addition to all the other information. While this is easily solved
using metainformation or introspection (in the case of machines), I
think it makes an URI much harder to guess for a human, and is hence
inconvenient.

This argument also applies to <name>-<version>.<type> | <name>.<type>,
though less so because you have to guess from only two possibilities.
However, you still need to guess, defeating the "U" in URI.

So I suggest we choose either

* <artifact> = <name>-<version>.<type>

or

* <artifact> = <name>.<type>

where my preference is for the former based on the dominant practice in
distribution repository setup (re: maven, rpm, apt, ports, cpan, pear).

=== My case for the entire URI ===

==== Common Ground ===

I think everyone agrees that the first part of the URI needs to be

<protocol>://<FQDN>/<base>

so lets start from that. Based on the principle that the URI should be
as short as possible and simple to remember, and contain no duplicate
information, and the assumption that

* <artifact> = <name>-<version>.<type>

==== My Preference ====

My preference is for

* <protocol>://<FQDN>/<base><organisation>/<project>/<artifact>

so the below information

FQDN = www.apache.org
protocol = http
base = dist/repository/
organisation = org.apache
project = jakarta
name = commons-logging
type = jar

results in an uri of

http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging-1.0.jar

==== Coping with filesystem limits ====

however, the potential danger here is that the project "jakarta" might
distribute 100s of files (which it does), resulting in a very long list
of files contained in the "jakarta" directory on the server, resulting
in too much output when visiting

http://www.apache.org/dist/repository/org.apache/jakarta/

with a normal browser (a problem common when browsing RPM repositories,
for example). To avoid that, I suggest we make the URI a bit longer by
repeating the <name> and <type> elements:

* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>

resulting in

http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging/jars/commons-logging-1.0.jar

the choice of <name> as a repetition element is I think accepted by all.
The rationale is that a user visiting

http://www.apache.org/dist/repository/org.apache/jakarta

will know what project he is looking for, but not neccessarily what
version ("just give me the latest") or what type ("I'll take whatever
you got, my tool can decompress anything").

The choice of one of

* <type>
* <type>s
* <type>/<version>
* <version>/<type>
* <version>/<type>s
* <version>

is less easy. I somewhat doubt that using either <type> or <version>
will result in very long lists of files in a single directory, so I
can't think of much of an argument for choosing between those, while I'd
say that rules using both of them out, for reasons of wanting a short
URI.

So, <version> or <type>? Based on looking at the setup used by rpm and
maven, I think the most common practice is <type>s, so I suggest we
go with that.

= We forgot something: architecture, os, language! =

Since we're mostly java developers, we don't need to worry about
architecture. However, for a general convention, we should take into
account other languages, like C and C++, which often result in specific
binaries. Even for java, there often are windows and linux-specific
versions (though I know of no java package for 386 as opposed to 686
architecture).

Architecture can be split into operating system and hardware platform,
though there is often some or a lot of overlap. Lets call the hardware
platform "architecture", and the operating system "os".

Then there's the case of languages: many software packages are not
multi-lingual, and specific version are provided for many different
languages.

I suggest we wrap "architecture", "os" and "language" into "version",
allowing distributors to figure out for themselves how to differentiate
between the various options. This makes life easier for java developers
and doesn't change the mess for other developers.

I couldn't find a common pattern anyway. Many linux vendors seperate on
language early (and then there's this dumb "en" directory with no
friends, as everyone uses english anyways), don't seperate on os (being
all about a single os after all), and seperate on architecture after
having seperated on type. But even here things are inconsistent:
just look at the language packs for KDE in a subsubsubdirectory of
/en/ in the case of RedHat.

Apache HTTPD does not seperate on language, but seperates binaries early
on, then includes the architecture as part of the version.

So there's no lesson to learn from prior art other than that it is a bit
messy :D

[proposal] repository URI format

Reply via email to