[GitHub] [incubator-annotator] Treora opened a new issue #75: Support TextPositionSelector (in the dom package)

GitBox Sun, 03 May 2020 09:55:13 -0700


Treora opened a new issue #75:
URL: https://github.com/apache/incubator-annotator/issues/75

Following its
[specification](https://www.w3.org/TR/2017/REC-annotation-model-20170223/#text-position-selector).

Altough it looks simple, there may be challenges in ensuring we count
characters correctly. From the spec (in the TextQuoteSelector section, but that
is then referred to by the TextPositionSelector section):

> The selection of the text MUST be in terms of unicode code points (the
"character number"), not in terms of code units (that number expressed using a
selected data type). Selections SHOULD NOT start or end in the middle of a
grapheme cluster. The selection MUST be based on the logical order of the text,
rather than the visual order, especially for bidirectional text. For more
information about the character model of text used on the web, see charmod.
>
> The text MUST be normalized before recording in the Annotation. Thus
HTML/XML tags SHOULD be removed, and character entities SHOULD be replaced with
the character that they encode.

The referenced ‘charmod’ (Character Model for the WWW) has [a
section](https://www.w3.org/TR/charmod/#sec-stringIndexing) on string indexing
that may be relevant.

What still confuses me a little is what constitutes the exact text of a DOM.
Given that normalisation should (why not *must*?) remove html tags, I suppose
this assumes we deal with the source html.

What then to do with comments: are those text, or are their ``
parts to be removed? In the latter case, would the document’s total text equal
the
[textContent](https://developer.mozilla.org/en-US/docs/Web/API/Node/textContent)
of all children of the Document? (one may think
`document.documentElement.textContent`, but that excludes whitespace and
comments outside the `<html>` element)

Possibly more problematic, can one even access the source html accurately
enough through the DOM? Might a source parser have modified whitespace, thus
leading to miscounts? I am not even talking about executed scripts that may
modify the DOM too, I suppose we have to disregard that scenario.

Of course there are implementations already whose approach and behaviour we
could copy, but it may be good to do the exercise of implementing based on the
spec to ensure that it matches up, also to help detect discrepancies between
implementations and spots where the spec may need to be improved/updated.

Any differences in implementations would likely result in misanchored
annotations, so doing this imprecisely seems of little value; unless the use is
explicitly limited to only apply to e.g. selector refinement within text nodes,
which could be a strategy to take.

@tilgovi (or others): what are your thoughts about this, and about the
implementation as it is done in
[dom-anchor-text-position](https://github.com/tilgovi/dom-anchor-text-position/),
in Hypothesis, or elsewhere?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-annotator] Treora opened a new issue #75: Support TextPositionSelector (in the dom package)

Reply via email to