On 12/2/06, Mike Schinkel <[EMAIL PROTECTED]> wrote:
A couple points on this subject. I have recently been doing a *lot* of research in the area of URLs/URIs and having discussions with numerous people on REST-discuss and www-TAG lists so I feel I'm pretty well-versed on this subject now.Although it is possible to infer an ISBN or maybe even a DOI from a URL, it is considered "Bad Practice" unless the "URI Authority" (i.e. owner of the website) specifically documented the structure of the URL and gave a reasonably trustworthy guarantee that it will not change. References: 1.) "Architecture of the World Wide Web, Volume One" section 2.5 on "URI Opacity" [1]: Good practice: URI opacity Agents making use of URIs SHOULD NOT attempt to infer properties of the referenced resource. 2.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI metadata" [2] Constraint: Web software MUST NOT depend on the correctness of metadata inferred from a URI, except when the encoding of such metadata is documented by applicable standards and specifications. 3.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI metadata" [2] The principle conclusions of this finding are: * Assignment authorities may publish specifications detailing the structure and semantics of the URIs they assign. Other users of those URIs may use such specifications to infer information about resources identified by URI assigned by that authority. * People and software using URIs assigned outside of their own authority should make as few inferences as possible about a resource based on its URI. The more dependencies a piece of software has on particular constraints and inferences, the more fragile it becomes to change and the lower its generic utility. In the case of Jon Udel's LibraryLookup which as been referenced as an example: Data point: ISBNs are already being reliably extracted from URLs: http://weblog.infoworld.com/udell/stories/2002/12/11/librarylookup.html Jon's work has been derided by purists as doing something it shouldn't i.e. "peeking" into URLs when they should remain opaque. Personally, I don't see what Jon did as such a bad thing. Jon's script interfaces with a human only, and if Amazon ever changes their URLs his script just won't work and the user will figure that out. In the mean time by breaking the rules he's offering pretty useful functionality that he couldn't get otherwise. And even Amazon does changes their URLs and his script breaks, which is highly unlikely given their affiliate program, Jon can just update his script and then anyone who has a broken script can search for Jon's new version (unless Amazon eliminates the ISBN from the URL, which I would highly doubt.) However, advocating the use of non-document metadata in a URL for a Microformat citation is a completely different matter. Rather than one author (Jon Udell) using it for one app (LibraryLookup) where it's users can later get updates if required, advocating it for a Microformat where authors will markup untold HTML content, much of which will never get updated for future revisions requires a very high bar for immutability. IOW, we should ensure that we have a *guarantee* that the format of the URL will *never* or we shouldn't use it. Yes we *could* still parse the old format, but we'd have to continue adding parsers some of which might eventually fail for ambiguity. At the moment, the only immutable reference for an ISBN is a URN from RFC 3187[4]. For example: URN:ISBN:0-395-36341-1 This doesn't deference in a browser, if used in IE7 for example, but one day it might. But we can be sure it is definitely immutable. As for resolving DOIs, they are new to me and I've not done enough research to determine if there is an immutable resolvable source for DOIs. This article[5] and these websites ([6] & [7]) might be helpful there. As an aside, please don't take this as me being unsupportive. On the contrary, I am a strong advocate to get website owners to put metadata in their URLs and to document that metadata. However, until we have solid sources of URLs with documented metadata, we should probably all play smartly by the rules as specified by the W3C, at least IMO. -Mike Schinkel http://www.mikeschinkel.com/blogs/ http://www.welldesignedurls.org/ [1] http://www.w3.org/TR/webarch/#uri-opacity [2] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html [3] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html#N1023D [4] http://www.ietf.org/rfc/rfc3187.txt [5] http://www.dlib.org/dlib/june98/06powell.html [6] http://www.handle.net/ [7] http://www.doi.org/
Mike, thanks for all the detail. I definitely learned some things. In the context of my original proposal to add a "URL" field to the microformat, I now feel like I need to separate that proposal from one of the statements I made in it: "I also suggest that in the case of identifiers like a DOI or ISBN which can be represented as a parameter in a link to doi.org or some other resolver, that the format encourage using a URL field for those identifiers and not include separate fields for each such identifier. In other words, I think that class="url uid" is sufficient to encode DOI/ISBN/etc., and we shouldn't add a separate DOI class, a separate ISBN class, and so on. " To be clear - I still think that *if* it is possible to mark up a DOI or ISBN as a link without obscuring the DOI, then that's a positive thing. It sounds like it's just more complicated than I thought to do that. So maybe the format doesn't need to mention those in connection with the URL field. I do think that a URL field (class="url") should be included, to represent a link to a copy of the cited work, and if we want to mark up one or more identifiers, we can use a separate class (I suggest "uid") to do so. If we're lucky and there's a good way to merge them, then use class="url uid". I'd like to get feedback on whether or not the list likes the idea of a URL field as outlined above - separate from the issue of URNs and metadata recovery. The use case I'm focused on is here: http://microformats.org/wiki/citation-brainstorming#Acquiring_reference_information_from_the_web Thanks, -mike -- Michael McCracken UCSD CSE PhD Candidate research: http://www.cse.ucsd.edu/~mmccrack/ misc: http://michael-mccracken.net/wp/ _______________________________________________ microformats-discuss mailing list [email protected] http://microformats.org/mailman/listinfo/microformats-discuss
