URI/URL Syntax -- little nits to be aware of

Adam R. B. Jack 9 Nov 2003 18:03:00 -0000

I know URI syntax is dragging on (and I don't know if we are coming to
consensus or going round and round) but I hope folks are still open eared to
this stuff, because IMHO the URI /URL syntax may be *the only critical
thing* we need to determine/document for repository to be at a satisfactory
phase 1.


As I believe Roy wrote -- we must include computer parsability into the
specification. I feel the URI and the resource/file names need to be machine
parsable so the directories/HTML are metadata in themselves for simple/smart
tools.

I'd like to throw out a few thoughts/experiences based upon attempting to
write/maintain Ruper (Ruper1 via regular expressions, Ruper2 via specialized
code) [parsing filenames/URIs] and Version (again, the latter) [parsing
version formats]. Note: nothing in what I am about to say advocates a code
base, just gotchas we ran in to that we ought all be aware of.

1) I have angst over the version in the URI (as a 'directory') only because
of the likely need for symbolic links for 'latest'. I think this is a burden
on publishing tools, and leads to errors (what if two tool were publishing
at same time, can symbolic links be created remotely, etc.) That said, read
on...
2) Version in the filename has it's issues also -- e.g.
"jakarta-servlet-api-4-1.1" -- is that version 4 or 1? (It is 1.1 of
jakarta-servlet-api-4.)
3) Some folks like to use _ not - for such separators. Some also like to use
periods in resource names. Both make resource parsing hard.

If we wish to parse we either need some convention or separator -- or we
need to better define the version namespace. Also, whether version is in the
filename or the directory, how does one 'understand' the version? Is
1.1-SNAPSHOT, "better" than 1.1, "better" than "-alpha"? If we want to
process versions we certainly need some sort of specification. [Note:
metadata in each group could define the version specification/standard,
etc.]

BTW: With code specifically trying to "sniff out the right stuff" Ruper2 is
currently able to process all but 35 of the couple of thousand of artefacts
on Maven's Ibiblio repository. Those 35 have resource name formats that
break parsing. Maybe we do an 80/20 rule, but it seems a real shame not to
have 100%.

BTW: The same parsing issues arise for anything at the end of the filename,
e.g. -src or -docs. How does one know those aren't some version attribute
(like -snapshot or -beta).

I don't know what folks views are, but I could see we have to break every
part of the URI down and define/document "best practices" or "standard" in
order to ensure the URIs were parsable.  A such, I believe we ought document
a URI and URL specification (on Wiki would be my preference, but if nobody
else volunteers to be secretary, I'd take that one.) I do feel strongly that
the syntax must be completely computer readable w/o additional metadata (at
least most of the time.)

[Sorry I dropped off these threads for a while, all the cross posting caused
duplicates and the apparent bulk of e-mail overwhelmed me. I've finally
worked through most of it. Are we safe to just post to repository@ now?]

     -----------------------------------------------------------------------
------------------

Separately, on resource URIs and resources URLs. I am game for a first cut
solution where URI = URL, i.e. every entity in a repository is uniquely
identifiable, and that identification happens to match it's location. That
said, I wonder if we want a URI syntax that is explicit (everything
separated into 'directories') so a user can easily express what they want,
yet potentially different URL (so repository managers can maintain more
easily.) I'm suspect we could benefit both users via a separation like this.
Just a thought...

Also, to provide benefit to the users we probably need abstractions such as
"latest", or groupings such as "all artefacts". Do we work those into a URI?
Into a URL?

Making a user come get the "jars" and then come get the "src" or "xml
resources" (if there are such things) seems rude. A user ought be able to
say type="all", and get all of them. My experience has been that grouping
those in one directory is probably easiest for the clients (since they don't
need metadata to make associations.) As such, this pushes one towards one
directory per group w/ all versions/types in there -- so long as the
filename is parsable. [I won't lie to you, I don't know what the right
solution is, sometimes separating is good, sometimes together is good. I
lean towards the latter.]

Finally, I suspect there will be "other stuff" (other URLs) within a
repository that do not revert to a resource URI (e.g. metadata files). I
suspect we have to be able to programmatically exclude those without
metadata. (e.g. dot files or all files ending with .xml are excluded, or
... )

Just some random thoughts...

regards,

Adam
--
Experience Sybase Technology...
http://www.try.sybase.com

URI/URL Syntax -- little nits to be aware of

Reply via email to