Re: [proposal] repository URI format

Costin Manolache 8 Mar 2003 16:22:06 -0000

On Sat, 8 Mar 2003, Andrew C. Oliver wrote:

> I'd like to take this opportunity to contrast my approach.  "stick some 
> XML descriptors on a webserver wherever you like and point at existing 
> files without renaming/moving them wherever they might be"..  The 
> "virtual" repository.


And each project should define its own XML tags, and avoid with all cost 
duplication. 

A second round of XML descriptors should be stick on a webserver wherever 
you like to describe where the first set is located and their XML schema.

Sorry - I couldn't resist.

Costin


> 
> Leo Simons wrote:
> 
> > Hi all,
> >
> > just read what's in the archive until now. I've summarized (well, not
> > really summarized, more elongated) the discussions up till now, added
> > my own thoughts, done some reasearch, and then I came to a conclusion. I
> > suggest y'all rip this apart, put it together again, (it's in
> > wiki-compatible format :D) and then someone tallies a vote on whatever
> > list is appropriate.
> >
> > cheers,
> >
> > - Leo
> >
> >         = THE URI FORMAT FOR A SOFTWARE ARTIFACT REPOSITORY =
> >
> > = Conclusion =
> >
> > I'll provide my conclusion first, as this is rather a lot of text :D
> >
> > When the following is known, the URI for any software distribution
> > architecture is uniquely specified:
> >
> > * <FQDN>         - fully qualified domain name of the repository as
> >                    defined in the URL spec
> > * <protocol>     - <scheme> as defined in the URI spec
> > * <base>         - base directory on the machine identified by <FQDN>
> >                    (probably relative to documentroot), preferably
> >            consisting of lowercase letters, dashes and slashes
> > * <organisation> - the inverse of the domain name of the organisation
> >                    that produces the artifact
> > * <project>      - the division/group within the organisation that
> >                    produces the artifact, preferably consisting of
> >            lowercase letters, dashes and slashes, with a website
> >            at http://<project>.<organisation>/
> > * <name>         - the name of the artifact (unique within the
> >                    <project>), preferably consisting of lowercase
> >            letters and dashes
> > * <type>         - the filetype of the artifact, consisting of whatever
> >                    part of the artifact filename normally identifies the
> >            filetype
> > * <version>      - the version of the artifact, consisting of any set of
> >                    characters allowed in an URI, augmented with any
> >            information about software or hardware platform
> >            requirements if not normally part of the version,
> >            preferably consiting of numbers, letters, dashes and
> >            points
> >
> > and for various reasons detailed below I think the URI should be
> > composed based on the above as follows:
> >
> > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
> >
> > = Goal of the Discussion =
> >
> > The goal of this part of the discussion is to define additional
> > constraints/ guidelines above and beyond the URI specification to make
> > it possible to uniquely and unambiguously define the location of a
> > software distribution artifact, when the following are known:
> >
> > * the /FQDN/ of the repository (ie repo.apache.org)
> > * the /protocol/ used to access the repository (ie http)
> > * the /base/ directory of the repository (ie /dist/repository or /)
> > * the /organisation/ that produces the artifact
> > * the /project/ within the organisation that produces the artifact
> > * the /name/ of the artifact
> > * the distribution /type/ of the artifact
> > * the /version/ of the artifact
> >
> > = Requirements =
> >
> > * The URI should be stable
> > * The URI should be easy to generate by humans and machines when
> >   the above listed items are known
> > * The URI should be unique based on the uniqueness of the above
> >   listed items, ie:
> >
> > if ! otherURI.FQDN == thisURI.FQDN
> >    return false
> > elseif ! otherURI.protocol == thisURI.protocol
> >    return false
> > elseif ! otherURI.basedir == thisURI.basedir
> >    return false
> > elseif ! otherURI.organisation == thisURI.organisation
> >    return false
> > elseif ! otherURI.project == thisURI.project
> >    return false
> > elseif ! otherURI.name == thisURI.name
> >    return false
> > elseif ! otherURI.version == thisURI.version
> >    return false
> > else
> >    return true
> >
> > * the part of the URI not containing FQDN, protocol and basedir should
> >   be common across repositories, ie it is desirable that an artifact
> >   identified by
> >
> > ** the organisation that produces the artifact
> > ** the project within the organisation that produces the artifact
> > ** the name of the artifact
> > ** the version of the artifact
> >
> >   can be found on any repository by substituting the repository FQDN,
> >   protocol and basedir from the current URI
> >
> > = Proposals =
> >
> > == Base Identifaction conventions ==
> >
> > base will often be "", but in the case of mirrors mirroring many
> > repositories (ie ibiblio), that might be impractical, in which case
> > I suggest the base is whatever maps to a directory on the filesystem
> > the repository is using (ie whatever ext2/3, fat32, whatever accepts
> > as a directory identifier).
> >
> > == Organisation Identification conventions ==
> >
> > It has been suggested that the identification of the organisation is
> > done by reverse domain names, ie "org.apache", "org.sun" and "com.ibm".
> >
> > It has also been suggested that the organisation is not identified
> > seperately (ie as is current practice on http://www.ibiblio.org/maven/).
> >
> > == Project Identification conventions ==
> >
> > It has been suggested that the identification of a project is done by
> > lowercase letters seperated by dashes, ie jakarta-commons.
> >
> > I have seen no suggestions as to how the apache project sturcture should
> > map into the project names in the repository, IOW, is the project part
> > of commons-logging.jar to be "jakarta", "jakarta-commons", or
> > "jakarta-commons-logging"? My suggestion is that the project structure
> > mapping is based on top-level-projects (ie *.apache.org), so the answer
> > to that question is "jakarta".
> >
> > In the context of sourceforge, the project identifaction would map
> > similarly, ie the convention of ${projectname}.${host}.org would lead to
> > project names of "jboss", "jedit", etc. Hence this sounds like a smart
> > mapping to me.
> >
> > == Artifact Naming conventions ==
> >
> > It has been suggested that the name of the artifact is to be determined
> > by the project providing the artifact, so that the "jakarta" project
> > determines what artifact name it will associate with the subsubproject
> > http://jakarta.apache.org/commons/logging. Of course, a project could
> > choose to delegate such a choice to a subproject or subsubproject; I
> > suggest we do not try and define who makes the artifact name choice
> > within a project :D
> >
> > It has been suggested that the name of the artifact is to be comprised
> > of lowercase letters seperated by dashes, ie commons-logging.
> >
> > == Versioning conventions ==
> >
> > I have seen no suggestions with regard to versioning. I assume everyone
> > agrees that the format of a version is determined by a project, though
> > the recommended practice is that a version is comprised of numbers
> > seperated by dashes and dots, and optionally containing lowercase
> > letters identifying part of the development cycle, ie
> >
> > * 1.0
> > * 1.0a
> > * 1.0-alpha
> > * 1.0-alpha-1
> > * 08032003
> > * 03082003
> > * 2003-03-08
> > * SNAPSHOT-03.08.2003
> >
> > are all acceptable, and the choice is made to conform to the versioning
> > number used by whomever supplies the artifact.
> >
> > == Distribution type conventions ==
> >
> > It has been suggested that a distribution type is defined by its
> > three-letter acronym, in lowercase, ie:
> >
> > jar
> > war
> > ear
> > rpm
> > tgz
> > zip
> >
> > I have not seen other suggestions. I myself suggest a distribution type
> > is identified by whatever filename component normally represents the
> > distribution type for a given artifact distribution, ie common types
> > would be:
> >
> > jar
> > war
> > rpm
> > tar.gz
> > tgz
> > zip
> >
> > where the use of tar.gz versus the use of tgz depends on the convention
> > used by the authoritative distributor of the artifact (ie for apache
> > httpd, the files are provied as .tar.gz, so the distribution type is
> > tar.gz and not tgz).
> >
> > == The URI format ==
> >
> > (refer to http://www.ietf.org/rfc/rfc2396.txt; note we can make the
> > assumption <protocol> == <scheme>)
> >
> > Adopting the convention <thing> to identify the parts of the URI, I
> > have seen the following suggestions:
> >
> > * 
> > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<version>/<artifact>
> >  
> >
> > * 
> > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
> >  
> >
> > * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact>
> > * <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<artifact>
> > * <protocol>://<FQDN>/<base><name>/<type>s/<artifact>
> >
> > where proposals for the format of <artifact> can be any of
> >
> > * <artifact> = <name>-<version>.<type>
> > * <artifact> = <name>.<type>
> > * <artifact> = ANY_VALID_URI_CHARACTERS
> > * <artifact> = <name>-<version>.<type> | <name>.<type>
> >
> > === The current maven repository format ===
> >
> > Maven uses two different setups:
> >
> > <artifact> = <name>-<version>.<type>
> > <protocol>://<FQDN>/<base><name>/<type>s/<artifact>
> >
> > if <type> == jar and
> >
> > <artifact> = <name>-<version>.<type>
> > <protocol>://<FQDN>/<base><name>/distributions/<artifact>
> >
> > if <type> == zip || <type> == tar.gz
> >
> > I don't think it provides other <type>s in its repo atm.
> >
> > === How to choose a format ===
> >
> > I think we should start with taking into account
> >
> > if ! otherURI.FQDN == thisURI.FQDN
> >    return false
> > elseif ! otherURI.protocol == thisURI.protocol
> >    return false
> > elseif ! otherURI.basedir == thisURI.basedir
> >    return false
> > elseif ! otherURI.organisation == thisURI.organisation
> >    return false
> > elseif ! otherURI.project == thisURI.project
> >    return false
> > elseif ! otherURI.name == thisURI.name
> >    return false
> > elseif ! otherURI.version == thisURI.version
> >    return false
> > else
> >    return true
> >
> > And once we have that settled, we should choose a layout which does
> > not duplicate information, in order to keep the URI short, ie I cannot
> > see why it is a good idea to specify (.*)<version>(.*)<version>(.*) for
> > putting the version in the URI.
> >
> > The next choice is between
> >
> > * <artifact> = <name>-<version>.<type>
> > * <artifact> = <name>.<type>
> > * <artifact> = ANY_VALID_URI_CHARACTERS
> > * <artifact> = <name>-<version>.<type> | <name>.<type>
> >
> > and when that is settled we can determine the rest of the URI.
> >
> > Note that the choice of <artifact> is important, as this is what most
> > applications will provide as the normal name for the user to save the
> > files.
> >
> > === My case for <artifact> ===
> >
> > The advantage of ANY_VALID_URI_CHARACTERS is that it reduces the need
> > for renaming of files when included in the repository: one can just use
> > the same filename as provided by the original artifact distributor.
> >
> > The big disadvantage is that this doesn't satisfy the requirment that an
> > URI should be identified as detailed below: you need to know <artifact>
> > in addition to all the other information. While this is easily solved
> > using metainformation or introspection (in the case of machines), I
> > think it makes an URI much harder to guess for a human, and is hence
> > inconvenient.
> >
> > This argument also applies to <name>-<version>.<type> | <name>.<type>,
> > though less so because you have to guess from only two possibilities.
> > However, you still need to guess, defeating the "U" in URI.
> >
> > So I suggest we choose either
> >
> > * <artifact> = <name>-<version>.<type>
> >
> > or
> >
> > * <artifact> = <name>.<type>
> >
> > where my preference is for the former based on the dominant practice in
> > distribution repository setup (re: maven, rpm, apt, ports, cpan, pear).
> >
> > === My case for the entire URI ===
> >
> > ==== Common Ground ===
> >
> > I think everyone agrees that the first part of the URI needs to be
> >
> > <protocol>://<FQDN>/<base>
> >
> > so lets start from that. Based on the principle that the URI should be
> > as short as possible and simple to remember, and contain no duplicate
> > information, and the assumption that
> >
> > * <artifact> = <name>-<version>.<type>
> >
> > ==== My Preference ====
> >
> > My preference is for
> >
> > * <protocol>://<FQDN>/<base><organisation>/<project>/<artifact>
> >
> > so the below information
> >
> > FQDN = www.apache.org
> > protocol = http
> > base = dist/repository/
> > organisation = org.apache
> > project = jakarta
> > name = commons-logging
> > type = jar
> >
> > results in an uri of
> >
> > http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging-1.0.jar
> >
> > ==== Coping with filesystem limits ====
> >
> > however, the potential danger here is that the project "jakarta" might
> > distribute 100s of files (which it does), resulting in a very long list
> > of files contained in the "jakarta" directory on the server, resulting
> > in too much output when visiting
> >
> > http://www.apache.org/dist/repository/org.apache/jakarta/
> >
> > with a normal browser (a problem common when browsing RPM repositories,
> > for example). To avoid that, I suggest we make the URI a bit longer by
> > repeating the <name> and <type> elements:
> >
> > * 
> > <protocol>://<FQDN>/<base><organisation>/<project>/<name>/<type>s/<artifact>
> >  
> >
> >
> > resulting in
> >
> > http://www.apache.org/dist/repository/org.apache/jakarta/commons-logging/jars/commons-logging-1.0.jar
> >
> > the choice of <name> as a repetition element is I think accepted by all.
> > The rationale is that a user visiting
> >
> > http://www.apache.org/dist/repository/org.apache/jakarta
> >
> > will know what project he is looking for, but not neccessarily what
> > version ("just give me the latest") or what type ("I'll take whatever
> > you got, my tool can decompress anything").
> >
> > The choice of one of
> >
> > * <type>
> > * <type>s
> > * <type>/<version>
> > * <version>/<type>
> > * <version>/<type>s
> > * <version>
> >
> > is less easy. I somewhat doubt that using either <type> or <version>
> > will result in very long lists of files in a single directory, so I
> > can't think of much of an argument for choosing between those, while I'd
> > say that rules using both of them out, for reasons of wanting a short
> > URI.
> >
> > So, <version> or <type>? Based on looking at the setup used by rpm and
> > maven, I think the most common practice is <type>s, so I suggest we
> > go with that.
> >
> > = We forgot something: architecture, os, language! =
> >
> > Since we're mostly java developers, we don't need to worry about
> > architecture. However, for a general convention, we should take into
> > account other languages, like C and C++, which often result in specific
> > binaries. Even for java, there often are windows and linux-specific
> > versions (though I know of no java package for 386 as opposed to 686
> > architecture).
> >
> > Architecture can be split into operating system and hardware platform,
> > though there is often some or a lot of overlap. Lets call the hardware
> > platform "architecture", and the operating system "os".
> >
> > Then there's the case of languages: many software packages are not
> > multi-lingual, and specific version are provided for many different
> > languages.
> >
> > I suggest we wrap "architecture", "os" and "language" into "version",
> > allowing distributors to figure out for themselves how to differentiate
> > between the various options. This makes life easier for java developers
> > and doesn't change the mess for other developers.
> >
> > I couldn't find a common pattern anyway. Many linux vendors seperate on
> > language early (and then there's this dumb "en" directory with no
> > friends, as everyone uses english anyways), don't seperate on os (being
> > all about a single os after all), and seperate on architecture after
> > having seperated on type. But even here things are inconsistent:
> > just look at the language packs for KDE in a subsubsubdirectory of
> > /en/ in the case of RedHat.
> >
> > Apache HTTPD does not seperate on language, but seperates binaries early
> > on, then includes the architecture as part of the version.
> >
> > So there's no lesson to learn from prior art other than that it is a bit
> > messy :D
> >
> >
> 
> 
>

Re: [proposal] repository URI format

Reply via email to