BigBlueHat opened a new issue #50: Document Identity Determination? URL: https://github.com/apache/incubator-annotator/issues/50 Curious to get thoughts from everyone on whether having document identification determination code would be useful for this project. By "document identification determination" I mean the process of sorting out which one (or more!) identifiers should be stored as the target. For instance: ```http GET /?utm_source=twitter&utm_medium=social Host: http://example.com/ ``` ```html <html> <head> <base href="http://cdn.example.com/"> <link rel="canonical" href="http://www.example.com/"> <link rel="latest-version" href="index.html"> <link rel="working-copy" href="newer.html"> <link rel="ogp:url" href="https://www.example.com/"> <link rel="schema:url" href="https://www.example.com/index.html"> </head> ``` The `?utm_` prefixed query param are typical marketing-bot tracking thingies. The `canonical` rel is from https://tools.ietf.org/html/rfc6596 The `latest-version` and `working-copy` rel's are from https://tools.ietf.org/html/rfc5829 The `ogp:url` is from http://ogp.me/ The `schema:url` is from http://schema.org/ At some level all (or most) of these are the same (presumably 😉). However, determining their "sameness" is outside of the scope of an annotation tool (I'd reckon), but storing the right one (or more) is mandatory for the annotation to make sense. What I'm wondering is if we should provide a basic retrieval mechanism for determining the existence and *potential* value of them to the annotation. At the very least it would be handy to get back a list of all stated identifiers for the current document. Real world scenario (which I just tripped over) is W3C Editorial Draft specs with GitHub URLs (or hosted locally) have their future Technical Recommendation (TR) URLs set as the `rel="canonical"` (which is injected by ReSpec post-page loading). Consequently, annotating the [Verifiable Claims Data Model](https://w3c.github.io/vc-data-model/) is hampered if *only* the `canonical` URL is stored (because it's not yet hit TR). It's that "other" part of annotation creation that's so fun. 😁 💭's?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services