On 12/2/06, Mike Schinkel <[EMAIL PROTECTED]> wrote:
A couple points on this subject. I have recently been doing a *lot* of
research in the area of URLs/URIs and having discussions with numerous
people on REST-discuss and www-TAG lists so I feel I'm pretty well-versed on
this subject now.

Although it is possible to infer an ISBN or maybe even a DOI from a URL, it
is considered "Bad Practice" unless the "URI Authority" (i.e. owner of the
website) specifically documented the structure of the URL and gave a
reasonably trustworthy guarantee that it will not change.

References:

1.) "Architecture of the World Wide Web, Volume One" section 2.5 on "URI
Opacity" [1]:

        Good practice: URI opacity
        Agents making use of URIs SHOULD NOT attempt to infer properties of
the referenced resource.

2.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI
metadata" [2]

        Constraint: Web software MUST NOT depend on the correctness of
metadata
        inferred from a URI, except when the encoding of such metadata is
documented
        by applicable standards and specifications.

3.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI
metadata" [2]

        The principle conclusions of this finding are:

        * Assignment authorities may publish specifications detailing the
structure and
        semantics of the URIs they assign. Other users of those URIs may use
such
        specifications to infer information about resources identified by
URI assigned by
        that authority.

        * People and software using URIs assigned outside of their own
authority should
        make as few inferences as possible about a resource based on its
URI. The more
        dependencies a piece of software has on particular constraints and
inferences,
        the more fragile it becomes to change and the lower its generic
utility.

In the case of Jon Udel's LibraryLookup which as been referenced as an
example:

        Data point: ISBNs are already being reliably extracted from URLs:

http://weblog.infoworld.com/udell/stories/2002/12/11/librarylookup.html

Jon's work has been derided by purists as doing something it shouldn't i.e.
"peeking" into URLs when they should remain opaque. Personally, I don't see
what Jon did as such a bad thing. Jon's script interfaces with a human only,
and if Amazon ever changes their URLs his script just won't work and the
user will figure that out. In the mean time by breaking the rules he's
offering pretty useful functionality that he couldn't get otherwise.  And
even Amazon does changes their URLs and his script breaks, which is highly
unlikely given their affiliate program, Jon can just update his script and
then anyone who has a broken script can search for Jon's new version (unless
Amazon eliminates the ISBN from the URL, which I would highly doubt.)

However, advocating the use of non-document metadata in a URL for a
Microformat citation is a completely different matter. Rather than one
author (Jon Udell) using it for one app (LibraryLookup) where it's users can
later get updates if required, advocating it for a Microformat where authors
will markup untold HTML content, much of which will never get updated for
future revisions requires a very high bar for immutability. IOW, we should
ensure that we have a *guarantee* that the format of the URL will *never* or
we shouldn't use it. Yes we *could* still parse the old format, but we'd
have to continue adding parsers some of which might eventually fail for
ambiguity.

At the moment, the only immutable reference for an ISBN is a URN from RFC
3187[4]. For example:

        URN:ISBN:0-395-36341-1

This doesn't deference in a browser, if used in IE7 for example, but one day
it might. But we can be sure it is definitely immutable.

As for resolving DOIs, they are new to me and I've not done enough research
to determine if there is an immutable resolvable source for DOIs.  This
article[5] and these websites ([6] & [7]) might be helpful there.

As an aside, please don't take this as me being unsupportive.  On the
contrary, I am a strong advocate to get website owners to put metadata in
their URLs and to document that metadata. However, until we have solid
sources of URLs with documented metadata, we should probably all play
smartly by the rules as specified by the W3C, at least IMO.

-Mike Schinkel
http://www.mikeschinkel.com/blogs/
http://www.welldesignedurls.org/

[1] http://www.w3.org/TR/webarch/#uri-opacity
[2] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html
[3] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html#N1023D
[4] http://www.ietf.org/rfc/rfc3187.txt
[5] http://www.dlib.org/dlib/june98/06powell.html
[6] http://www.handle.net/
[7] http://www.doi.org/


Mike, thanks for all the detail. I definitely learned some things.

In the context of my original proposal to add a "URL" field to the
microformat, I now feel like I need to separate that proposal from one
of the statements I made in it:

"I also suggest that in the case of identifiers like a DOI or ISBN
which can be represented as a parameter in a link to doi.org or some
other resolver, that the format encourage using a URL field for those
identifiers and not include separate fields for each such identifier.
In other words, I think that class="url uid"  is sufficient to encode
DOI/ISBN/etc., and we shouldn't add a separate DOI class, a separate
ISBN class, and so on.
"

To be clear - I still think that *if* it is possible to mark up a DOI
or ISBN as a link without obscuring the DOI, then that's a positive
thing. It sounds like it's just more complicated than I thought to do
that. So maybe the format doesn't need to mention those in connection
with the URL field.

I do think that a URL field (class="url") should be included, to
represent a link to a copy of the cited work, and if we want to mark
up one or more identifiers, we can use a separate class (I suggest
"uid") to do so. If we're lucky and there's a good way to merge them,
then use class="url uid".

I'd like to get feedback on whether or not the list likes the idea of
a URL field as outlined above - separate from the issue of URNs and
metadata recovery.

The use case I'm focused on is here:
http://microformats.org/wiki/citation-brainstorming#Acquiring_reference_information_from_the_web

Thanks,
-mike

--
Michael McCracken
UCSD CSE PhD Candidate
research: http://www.cse.ucsd.edu/~mmccrack/
misc: http://michael-mccracken.net/wp/
_______________________________________________
microformats-discuss mailing list
[email protected]
http://microformats.org/mailman/listinfo/microformats-discuss

Reply via email to