Hi all,
just read what's in the archive until now. I've summarized (well, not really summarized, more elongated) the discussions up till now, added my own thoughts, done some reasearch, and then I came to a conclusion. I suggest y'all rip this apart, put it together again, (it's in wiki-compatible format :D) and then someone tallies a vote on whatever list is appropriate.
cheers,
- Leo
= THE URI FORMAT FOR A SOFTWARE ARTIFACT REPOSITORY =
= Conclusion =
I'll provide my conclusion first, as this is rather a lot of text :D
When the following is known, the URI for any software distribution architecture is uniquely specified:
* <FQDN> - fully qualified domain name of the repository as
defined in the URL spec
* <protocol> - <scheme> as defined in the URI spec
* <base> - base directory on the machine identified by <FQDN>
(probably relative to documentroot), preferably
consisting of lowercase letters, dashes and slashes
* <organisation> - the inverse of the domain name of the organisation
that produces the artifact
* <project> - the division/group within the organisation that
produces the artifact, preferably consisting of
lowercase letters, dashes and slashes, with a website
at http://<project>.<organisation>/
* <name> - the name of the artifact (unique within the
<project>), preferably consisting of lowercase
letters and dashes
* <type> - the filetype of the artifact, consisting of whatever
part of the artifact filename normally identifies the
filetype
* <version> - the version of the artifact, consisting of any set of
characters allowed in an URI, augmented with any
information about software or hardware platform
requirements if not normally part of the version,
preferably consiting of numbers, letters, dashes and
pointsand for various reasons detailed below I think the URI should be composed based on the above as follows:
<protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
= Goal of the Discussion =
The goal of this part of the discussion is to define additional constraints/ guidelines above and beyond the URI specification to make it possible to uniquely and unambiguously define the location of a software distribution artifact, when the following are known:
* the /FQDN/ of the repository (ie repo.apache.org) * the /protocol/ used to access the repository (ie http) * the /base/ directory of the repository (ie /dist/repository or /) * the /organisation/ that produces the artifact * the /project/ within the organisation that produces the artifact * the /name/ of the artifact * the distribution /type/ of the artifact * the /version/ of the artifact
= Requirements =
* The URI should be stable * The URI should be easy to generate by humans and machines when the above listed items are known * The URI should be unique based on the uniqueness of the above listed items, ie:
if ! otherURI.FQDN == thisURI.FQDN return false elseif ! otherURI.protocol == thisURI.protocol return false elseif ! otherURI.basedir == thisURI.basedir return false elseif ! otherURI.organisation == thisURI.organisation return false elseif ! otherURI.project == thisURI.project return false elseif ! otherURI.name == thisURI.name return false elseif ! otherURI.version == thisURI.version return false else return true
* the part of the URI not containing FQDN, protocol and basedir should be common across repositories, ie it is desirable that an artifact identified by
** the organisation that produces the artifact ** the project within the organisation that produces the artifact ** the name of the artifact ** the version of the artifact
can be found on any repository by substituting the repository FQDN, protocol and basedir from the current URI
= Proposals =
== Base Identifaction conventions ==
base will often be "", but in the case of mirrors mirroring many repositories (ie ibiblio), that might be impractical, in which case I suggest the base is whatever maps to a directory on the filesystem the repository is using (ie whatever ext2/3, fat32, whatever accepts as a directory identifier).
== Organisation Identification conventions ==
It has been suggested that the identification of the organisation is done by reverse domain names, ie "org.apache", "org.sun" and "com.ibm".
It has also been suggested that the organisation is not identified seperately (ie as is current practice on http://www.ibiblio.org/maven/).
== Project Identification conventions ==
It has been suggested that the identification of a project is done by lowercase letters seperated by dashes, ie jakarta-commons.
I have seen no suggestions as to how the apache project sturcture should map into the project names in the repository, IOW, is the project part of commons-logging.jar to be "jakarta", "jakarta-commons", or "jakarta-commons-logging"? My suggestion is that the project structure mapping is based on top-level-projects (ie *.apache.org), so the answer to that question is "jakarta".
In the context of sourceforge, the project identifaction would map
similarly, ie the convention of ${projectname}.${host}.org would lead to
project names of "jboss", "jedit", etc. Hence this sounds like a smart
mapping to me.== Artifact Naming conventions ==
It has been suggested that the name of the artifact is to be determined by the project providing the artifact, so that the "jakarta" project determines what artifact name it will associate with the subsubproject http://jakarta.apache.org/commons/logging. Of course, a project could choose to delegate such a choice to a subproject or subsubproject; I suggest we do not try and define who makes the artifact name choice within a project :D
It has been suggested that the name of the artifact is to be comprised of lowercase letters seperated by dashes, ie commons-logging.
== Versioning conventions ==
I have seen no suggestions with regard to versioning. I assume everyone agrees that the format of a version is determined by a project, though the recommended practice is that a version is comprised of numbers seperated by dashes and dots, and optionally containing lowercase letters identifying part of the development cycle, ie
* 1.0 * 1.0a * 1.0-alpha * 1.0-alpha-1 * 08032003 * 03082003 * 2003-03-08 * SNAPSHOT-03.08.2003
are all acceptable, and the choice is made to conform to the versioning number used by whomever supplies the artifact.
== Distribution type conventions ==
It has been suggested that a distribution type is defined by its three-letter acronym, in lowercase, ie:
jar war ear rpm tgz zip
I have not seen other suggestions. I myself suggest a distribution type is identified by whatever filename component normally represents the distribution type for a given artifact distribution, ie common types would be:
jar war rpm tar.gz tgz zip
where the use of tar.gz versus the use of tgz depends on the convention used by the authoritative distributor of the artifact (ie for apache httpd, the files are provied as .tar.gz, so the distribution type is tar.gz and not tgz).
== The URI format ==
(refer to http://www.ietf.org/rfc/rfc2396.txt; note we can make the assumption <protocol> == <scheme>)
Adopting the convention <thing> to identify the parts of the URI, I have seen the following suggestions:
* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<version>/<artifact>
* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact>
* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact>
* <protocol>://<FQDN>/<base><name>/<type>s/<artifact>
where proposals for the format of <artifact> can be any of
* <artifact> = <name>-<version>.<type> * <artifact> = <name>.<type> * <artifact> = ANY_VALID_URI_CHARACTERS * <artifact> = <name>-<version>.<type> | <name>.<type>
=== The current maven repository format ===
Maven uses two different setups:
<artifact> = <name>-<version>.<type> <protocol>://<FQDN>/<base><name>/<type>s/<artifact>
if <type> == jar and
<artifact> = <name>-<version>.<type> <protocol>://<FQDN>/<base><name>/distributions/<artifact>
if <type> == zip || <type> == tar.gz
I don't think it provides other <type>s in its repo atm.
=== How to choose a format ===
I think we should start with taking into account
if ! otherURI.FQDN == thisURI.FQDN return false elseif ! otherURI.protocol == thisURI.protocol return false elseif ! otherURI.basedir == thisURI.basedir return false elseif ! otherURI.organisation == thisURI.organisation return false elseif ! otherURI.project == thisURI.project return false elseif ! otherURI.name == thisURI.name return false elseif ! otherURI.version == thisURI.version return false else return true
And once we have that settled, we should choose a layout which does not duplicate information, in order to keep the URI short, ie I cannot see why it is a good idea to specify (.*)<version>(.*)<version>(.*) for putting the version in the URI.
The next choice is between
* <artifact> = <name>-<version>.<type> * <artifact> = <name>.<type> * <artifact> = ANY_VALID_URI_CHARACTERS * <artifact> = <name>-<version>.<type> | <name>.<type>
and when that is settled we can determine the rest of the URI.
Note that the choice of <artifact> is important, as this is what most applications will provide as the normal name for the user to save the files.
=== My case for <artifact> ===
The advantage of ANY_VALID_URI_CHARACTERS is that it reduces the need for renaming of files when included in the repository: one can just use the same filename as provided by the original artifact distributor.
The big disadvantage is that this doesn't satisfy the requirment that an URI should be identified as detailed below: you need to know <artifact> in addition to all the other information. While this is easily solved using metainformation or introspection (in the case of machines), I think it makes an URI much harder to guess for a human, and is hence inconvenient.
This argument also applies to <name>-<version>.<type> | <name>.<type>, though less so because you have to guess from only two possibilities. However, you still need to guess, defeating the "U" in URI.
So I suggest we choose either
* <artifact> = <name>-<version>.<type>
or
* <artifact> = <name>.<type>
where my preference is for the former based on the dominant practice in distribution repository setup (re: maven, rpm, apt, ports, cpan, pear).
=== My case for the entire URI ===
==== Common Ground ===
I think everyone agrees that the first part of the URI needs to be
<protocol>://<FQDN>/<base>
so lets start from that. Based on the principle that the URI should be as short as possible and simple to remember, and contain no duplicate information, and the assumption that
* <artifact> = <name>-<version>.<type>
==== My Preference ====
My preference is for
* <protocol>://<FQDN>/<base><organisation>/<project>/<artifact>
so the below information
FQDN = www.apache.org protocol = http base = dist/repository/ organisation = org.apache project = jakarta name = commons-logging type = jar
results in an uri of
http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging-1.0.jar
==== Coping with filesystem limits ====
however, the potential danger here is that the project "jakarta" might distribute 100s of files (which it does), resulting in a very long list of files contained in the "jakarta" directory on the server, resulting in too much output when visiting
http://www.apache.org/dist/repository/org.apache/jakarta/
with a normal browser (a problem common when browsing RPM repositories, for example). To avoid that, I suggest we make the URI a bit longer by repeating the <name> and <type> elements:
* <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
resulting in
http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging/jars/commons-logging-1.0.jar
the choice of <name> as a repetition element is I think accepted by all. The rationale is that a user visiting
http://www.apache.org/dist/repository/org.apache/jakarta
will know what project he is looking for, but not neccessarily what
version ("just give me the latest") or what type ("I'll take whatever
you got, my tool can decompress anything").The choice of one of
* <type> * <type>s * <type>/<version> * <version>/<type> * <version>/<type>s * <version>
is less easy. I somewhat doubt that using either <type> or <version> will result in very long lists of files in a single directory, so I can't think of much of an argument for choosing between those, while I'd say that rules using both of them out, for reasons of wanting a short URI.
So, <version> or <type>? Based on looking at the setup used by rpm and maven, I think the most common practice is <type>s, so I suggest we go with that.
= We forgot something: architecture, os, language! =
Since we're mostly java developers, we don't need to worry about architecture. However, for a general convention, we should take into account other languages, like C and C++, which often result in specific binaries. Even for java, there often are windows and linux-specific versions (though I know of no java package for 386 as opposed to 686 architecture).
Architecture can be split into operating system and hardware platform, though there is often some or a lot of overlap. Lets call the hardware platform "architecture", and the operating system "os".
Then there's the case of languages: many software packages are not multi-lingual, and specific version are provided for many different languages.
I suggest we wrap "architecture", "os" and "language" into "version", allowing distributors to figure out for themselves how to differentiate between the various options. This makes life easier for java developers and doesn't change the mess for other developers.
I couldn't find a common pattern anyway. Many linux vendors seperate on language early (and then there's this dumb "en" directory with no friends, as everyone uses english anyways), don't seperate on os (being all about a single os after all), and seperate on architecture after having seperated on type. But even here things are inconsistent: just look at the language packs for KDE in a subsubsubdirectory of /en/ in the case of RedHat.
Apache HTTPD does not seperate on language, but seperates binaries early on, then includes the architecture as part of the version.
So there's no lesson to learn from prior art other than that it is a bit messy :D
