On Sat, 8 Mar 2003, Andrew C. Oliver wrote: > I'd like to take this opportunity to contrast my approach. "stick some > XML descriptors on a webserver wherever you like and point at existing > files without renaming/moving them wherever they might be".. The > "virtual" repository.
And each project should define its own XML tags, and avoid with all cost duplication. A second round of XML descriptors should be stick on a webserver wherever you like to describe where the first set is located and their XML schema. Sorry - I couldn't resist. Costin > > Leo Simons wrote: > > > Hi all, > > > > just read what's in the archive until now. I've summarized (well, not > > really summarized, more elongated) the discussions up till now, added > > my own thoughts, done some reasearch, and then I came to a conclusion. I > > suggest y'all rip this apart, put it together again, (it's in > > wiki-compatible format :D) and then someone tallies a vote on whatever > > list is appropriate. > > > > cheers, > > > > - Leo > > > > = THE URI FORMAT FOR A SOFTWARE ARTIFACT REPOSITORY = > > > > = Conclusion = > > > > I'll provide my conclusion first, as this is rather a lot of text :D > > > > When the following is known, the URI for any software distribution > > architecture is uniquely specified: > > > > * <FQDN> - fully qualified domain name of the repository as > > defined in the URL spec > > * <protocol> - <scheme> as defined in the URI spec > > * <base> - base directory on the machine identified by <FQDN> > > (probably relative to documentroot), preferably > > consisting of lowercase letters, dashes and slashes > > * <organisation> - the inverse of the domain name of the organisation > > that produces the artifact > > * <project> - the division/group within the organisation that > > produces the artifact, preferably consisting of > > lowercase letters, dashes and slashes, with a website > > at http://<project>.<organisation>/ > > * <name> - the name of the artifact (unique within the > > <project>), preferably consisting of lowercase > > letters and dashes > > * <type> - the filetype of the artifact, consisting of whatever > > part of the artifact filename normally identifies the > > filetype > > * <version> - the version of the artifact, consisting of any set of > > characters allowed in an URI, augmented with any > > information about software or hardware platform > > requirements if not normally part of the version, > > preferably consiting of numbers, letters, dashes and > > points > > > > and for various reasons detailed below I think the URI should be > > composed based on the above as follows: > > > > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact> > > > > = Goal of the Discussion = > > > > The goal of this part of the discussion is to define additional > > constraints/ guidelines above and beyond the URI specification to make > > it possible to uniquely and unambiguously define the location of a > > software distribution artifact, when the following are known: > > > > * the /FQDN/ of the repository (ie repo.apache.org) > > * the /protocol/ used to access the repository (ie http) > > * the /base/ directory of the repository (ie /dist/repository or /) > > * the /organisation/ that produces the artifact > > * the /project/ within the organisation that produces the artifact > > * the /name/ of the artifact > > * the distribution /type/ of the artifact > > * the /version/ of the artifact > > > > = Requirements = > > > > * The URI should be stable > > * The URI should be easy to generate by humans and machines when > > the above listed items are known > > * The URI should be unique based on the uniqueness of the above > > listed items, ie: > > > > if ! otherURI.FQDN == thisURI.FQDN > > return false > > elseif ! otherURI.protocol == thisURI.protocol > > return false > > elseif ! otherURI.basedir == thisURI.basedir > > return false > > elseif ! otherURI.organisation == thisURI.organisation > > return false > > elseif ! otherURI.project == thisURI.project > > return false > > elseif ! otherURI.name == thisURI.name > > return false > > elseif ! otherURI.version == thisURI.version > > return false > > else > > return true > > > > * the part of the URI not containing FQDN, protocol and basedir should > > be common across repositories, ie it is desirable that an artifact > > identified by > > > > ** the organisation that produces the artifact > > ** the project within the organisation that produces the artifact > > ** the name of the artifact > > ** the version of the artifact > > > > can be found on any repository by substituting the repository FQDN, > > protocol and basedir from the current URI > > > > = Proposals = > > > > == Base Identifaction conventions == > > > > base will often be "", but in the case of mirrors mirroring many > > repositories (ie ibiblio), that might be impractical, in which case > > I suggest the base is whatever maps to a directory on the filesystem > > the repository is using (ie whatever ext2/3, fat32, whatever accepts > > as a directory identifier). > > > > == Organisation Identification conventions == > > > > It has been suggested that the identification of the organisation is > > done by reverse domain names, ie "org.apache", "org.sun" and "com.ibm". > > > > It has also been suggested that the organisation is not identified > > seperately (ie as is current practice on http://www.ibiblio.org/maven/). > > > > == Project Identification conventions == > > > > It has been suggested that the identification of a project is done by > > lowercase letters seperated by dashes, ie jakarta-commons. > > > > I have seen no suggestions as to how the apache project sturcture should > > map into the project names in the repository, IOW, is the project part > > of commons-logging.jar to be "jakarta", "jakarta-commons", or > > "jakarta-commons-logging"? My suggestion is that the project structure > > mapping is based on top-level-projects (ie *.apache.org), so the answer > > to that question is "jakarta". > > > > In the context of sourceforge, the project identifaction would map > > similarly, ie the convention of ${projectname}.${host}.org would lead to > > project names of "jboss", "jedit", etc. Hence this sounds like a smart > > mapping to me. > > > > == Artifact Naming conventions == > > > > It has been suggested that the name of the artifact is to be determined > > by the project providing the artifact, so that the "jakarta" project > > determines what artifact name it will associate with the subsubproject > > http://jakarta.apache.org/commons/logging. Of course, a project could > > choose to delegate such a choice to a subproject or subsubproject; I > > suggest we do not try and define who makes the artifact name choice > > within a project :D > > > > It has been suggested that the name of the artifact is to be comprised > > of lowercase letters seperated by dashes, ie commons-logging. > > > > == Versioning conventions == > > > > I have seen no suggestions with regard to versioning. I assume everyone > > agrees that the format of a version is determined by a project, though > > the recommended practice is that a version is comprised of numbers > > seperated by dashes and dots, and optionally containing lowercase > > letters identifying part of the development cycle, ie > > > > * 1.0 > > * 1.0a > > * 1.0-alpha > > * 1.0-alpha-1 > > * 08032003 > > * 03082003 > > * 2003-03-08 > > * SNAPSHOT-03.08.2003 > > > > are all acceptable, and the choice is made to conform to the versioning > > number used by whomever supplies the artifact. > > > > == Distribution type conventions == > > > > It has been suggested that a distribution type is defined by its > > three-letter acronym, in lowercase, ie: > > > > jar > > war > > ear > > rpm > > tgz > > zip > > > > I have not seen other suggestions. I myself suggest a distribution type > > is identified by whatever filename component normally represents the > > distribution type for a given artifact distribution, ie common types > > would be: > > > > jar > > war > > rpm > > tar.gz > > tgz > > zip > > > > where the use of tar.gz versus the use of tgz depends on the convention > > used by the authoritative distributor of the artifact (ie for apache > > httpd, the files are provied as .tar.gz, so the distribution type is > > tar.gz and not tgz). > > > > == The URI format == > > > > (refer to http://www.ietf.org/rfc/rfc2396.txt; note we can make the > > assumption <protocol> == <scheme>) > > > > Adopting the convention <thing> to identify the parts of the URI, I > > have seen the following suggestions: > > > > * > > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<version>/<artifact> > > > > > > * > > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact> > > > > > > * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact> > > * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact> > > * <protocol>://<FQDN>/<base><name>/<type>s/<artifact> > > > > where proposals for the format of <artifact> can be any of > > > > * <artifact> = <name>-<version>.<type> > > * <artifact> = <name>.<type> > > * <artifact> = ANY_VALID_URI_CHARACTERS > > * <artifact> = <name>-<version>.<type> | <name>.<type> > > > > === The current maven repository format === > > > > Maven uses two different setups: > > > > <artifact> = <name>-<version>.<type> > > <protocol>://<FQDN>/<base><name>/<type>s/<artifact> > > > > if <type> == jar and > > > > <artifact> = <name>-<version>.<type> > > <protocol>://<FQDN>/<base><name>/distributions/<artifact> > > > > if <type> == zip || <type> == tar.gz > > > > I don't think it provides other <type>s in its repo atm. > > > > === How to choose a format === > > > > I think we should start with taking into account > > > > if ! otherURI.FQDN == thisURI.FQDN > > return false > > elseif ! otherURI.protocol == thisURI.protocol > > return false > > elseif ! otherURI.basedir == thisURI.basedir > > return false > > elseif ! otherURI.organisation == thisURI.organisation > > return false > > elseif ! otherURI.project == thisURI.project > > return false > > elseif ! otherURI.name == thisURI.name > > return false > > elseif ! otherURI.version == thisURI.version > > return false > > else > > return true > > > > And once we have that settled, we should choose a layout which does > > not duplicate information, in order to keep the URI short, ie I cannot > > see why it is a good idea to specify (.*)<version>(.*)<version>(.*) for > > putting the version in the URI. > > > > The next choice is between > > > > * <artifact> = <name>-<version>.<type> > > * <artifact> = <name>.<type> > > * <artifact> = ANY_VALID_URI_CHARACTERS > > * <artifact> = <name>-<version>.<type> | <name>.<type> > > > > and when that is settled we can determine the rest of the URI. > > > > Note that the choice of <artifact> is important, as this is what most > > applications will provide as the normal name for the user to save the > > files. > > > > === My case for <artifact> === > > > > The advantage of ANY_VALID_URI_CHARACTERS is that it reduces the need > > for renaming of files when included in the repository: one can just use > > the same filename as provided by the original artifact distributor. > > > > The big disadvantage is that this doesn't satisfy the requirment that an > > URI should be identified as detailed below: you need to know <artifact> > > in addition to all the other information. While this is easily solved > > using metainformation or introspection (in the case of machines), I > > think it makes an URI much harder to guess for a human, and is hence > > inconvenient. > > > > This argument also applies to <name>-<version>.<type> | <name>.<type>, > > though less so because you have to guess from only two possibilities. > > However, you still need to guess, defeating the "U" in URI. > > > > So I suggest we choose either > > > > * <artifact> = <name>-<version>.<type> > > > > or > > > > * <artifact> = <name>.<type> > > > > where my preference is for the former based on the dominant practice in > > distribution repository setup (re: maven, rpm, apt, ports, cpan, pear). > > > > === My case for the entire URI === > > > > ==== Common Ground === > > > > I think everyone agrees that the first part of the URI needs to be > > > > <protocol>://<FQDN>/<base> > > > > so lets start from that. Based on the principle that the URI should be > > as short as possible and simple to remember, and contain no duplicate > > information, and the assumption that > > > > * <artifact> = <name>-<version>.<type> > > > > ==== My Preference ==== > > > > My preference is for > > > > * <protocol>://<FQDN>/<base><organisation>/<project>/<artifact> > > > > so the below information > > > > FQDN = www.apache.org > > protocol = http > > base = dist/repository/ > > organisation = org.apache > > project = jakarta > > name = commons-logging > > type = jar > > > > results in an uri of > > > > http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging-1.0.jar > > > > ==== Coping with filesystem limits ==== > > > > however, the potential danger here is that the project "jakarta" might > > distribute 100s of files (which it does), resulting in a very long list > > of files contained in the "jakarta" directory on the server, resulting > > in too much output when visiting > > > > http://www.apache.org/dist/repository/org.apache/jakarta/ > > > > with a normal browser (a problem common when browsing RPM repositories, > > for example). To avoid that, I suggest we make the URI a bit longer by > > repeating the <name> and <type> elements: > > > > * > > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact> > > > > > > > > resulting in > > > > http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging/jars/commons-logging-1.0.jar > > > > the choice of <name> as a repetition element is I think accepted by all. > > The rationale is that a user visiting > > > > http://www.apache.org/dist/repository/org.apache/jakarta > > > > will know what project he is looking for, but not neccessarily what > > version ("just give me the latest") or what type ("I'll take whatever > > you got, my tool can decompress anything"). > > > > The choice of one of > > > > * <type> > > * <type>s > > * <type>/<version> > > * <version>/<type> > > * <version>/<type>s > > * <version> > > > > is less easy. I somewhat doubt that using either <type> or <version> > > will result in very long lists of files in a single directory, so I > > can't think of much of an argument for choosing between those, while I'd > > say that rules using both of them out, for reasons of wanting a short > > URI. > > > > So, <version> or <type>? Based on looking at the setup used by rpm and > > maven, I think the most common practice is <type>s, so I suggest we > > go with that. > > > > = We forgot something: architecture, os, language! = > > > > Since we're mostly java developers, we don't need to worry about > > architecture. However, for a general convention, we should take into > > account other languages, like C and C++, which often result in specific > > binaries. Even for java, there often are windows and linux-specific > > versions (though I know of no java package for 386 as opposed to 686 > > architecture). > > > > Architecture can be split into operating system and hardware platform, > > though there is often some or a lot of overlap. Lets call the hardware > > platform "architecture", and the operating system "os". > > > > Then there's the case of languages: many software packages are not > > multi-lingual, and specific version are provided for many different > > languages. > > > > I suggest we wrap "architecture", "os" and "language" into "version", > > allowing distributors to figure out for themselves how to differentiate > > between the various options. This makes life easier for java developers > > and doesn't change the mess for other developers. > > > > I couldn't find a common pattern anyway. Many linux vendors seperate on > > language early (and then there's this dumb "en" directory with no > > friends, as everyone uses english anyways), don't seperate on os (being > > all about a single os after all), and seperate on architecture after > > having seperated on type. But even here things are inconsistent: > > just look at the language packs for KDE in a subsubsubdirectory of > > /en/ in the case of RedHat. > > > > Apache HTTPD does not seperate on language, but seperates binaries early > > on, then includes the architecture as part of the version. > > > > So there's no lesson to learn from prior art other than that it is a bit > > messy :D > > > > > > >