Re: DOM Iteration (was Re: Just a simple example?)

Sasha Goodman Mon, 15 May 2017 12:19:50 -0700

Here is a demo of simple annotation, thanks to Benjamin:

https://predict-r.github.io/annotation-model/



On Fri, May 12, 2017 at 12:19 PM Sasha Goodman <[email protected]>
wrote:

> I would be delighted if my efforts were useful in this project!!!
> Regarding that code, if any parts are used it would make my week. The class
> structure is sorta self-documented by the standard, and combined with
> builders the classes it can accommodate a variety of motives.
>
> Highlighting is the most common motive now (correct me if I'm wrong). My
> gut-feeling is that to get the support and time of hard core annotators,
> the code needs to accommodate the idiosyncrasies of highlighting first. For
> example, if there are thousands of highlights on a page, an annotation
> builder might iterate/walk the document just once and fill in the thousands
> of highlights in one pass. Also, a highlighting app would probably need to
> modify the source document by inserting spans and such.
>
> If Randall needs familiar code for node iteration, tree walking, range
> splitting, string similarity and normalization, that's cool! Custom code,
> *especially* Polyfill type implementations, could smooth over browser
> idiosyncrasies. Also, I saw a Jsperf.com microbenchmark that put custom
> walkers on par with the native browser based ones.
>
> On a personal note, I do archival work and did not initially see the value
> in modifying the source document by inserting spans (however, a highlight
> app would need that). The main reason I'm excited about annotation is its
> value for labeling data for text analysis and machine learning. A lot of
> the advancements in machine learning are because of large bodies of data
> that have been tagged. The most common examples are usually of images that
> have regions selected and then labeled, but annotation could also help turn
> semi-structured text into more structured text data (e.g. for labeling
> parts of government documents). For archival work on mostly static
> documents, there does not seem to be a need to modify source document. On
> the other hand, for dynamically changing documents, inserting spans with
> unique IDs seems appropriate because its more robust to document changes.
> Yet, it is also vulnerable to turf battles with other extensions and the
> page's own javascript, so I hope it's not a requirement of the Apache
> library but rather a feature.
>
>
> On Thu, May 11, 2017 at 1:43 PM Benjamin Young <[email protected]>
> wrote:
>
>> Exciting to see this conversation happening. ^_^
>>
>>
>> Randall, how feasible would it be to bring (soon) your libraries (even
>> via copy/paste) into the Apache Annotator repo. I believe (according to
>> GitHub) you're author/owner of 90%+ of the code in them, and (consequently)
>> able to do that if you believe that's the right step.
>>
>>
>> Sasha you're classes modeled around the selector and a "builder" sound
>> very similar to the hopes I wrote up in
>> https://cwiki.apache.org/confluence/display/ANNO/Planning
>>
>>
>> I'd very much like to combine these efforts in some way.
>>
>>
>> Additionally--and the thing driving me personally at the moment--I have
>> to present on Apache Annotator next Wednesday!
>>
>> https://apachecon2017.sched.com/event/AbBW
>>
>>
>> Consequently, I'd very much love it if we (collectively) could build a
>> demo together! There's plenty to talk about wrt to annotation, community
>> building, Web Annotation Data Model & Protocol, as well as why (those of us
>> that are here at least) have chosen to start collaborating at the ASF.
>>
>>
>> At any rate, I plan to be coding on all the things leading up to
>> Wednesday, so any help, input, pointers, and code (hehe) that anyone wants
>> to toss in ahead of my codez, I'd be most grateful to code together!
>>
>>
>> Thanks, all!
>>
>> Benjamin
>>
>> --
>>
>> http://bigbluehat.com/
>>
>> http://linkedin.com/in/benjaminyoung
>>
>> ________________________________
>> From: Randall Leeds <[email protected]>
>> Sent: Thursday, May 11, 2017 3:34:24 PM
>> To: [email protected]
>> Subject: DOM Iteration (was Re: Just a simple example?)
>>
>> Great to see you here, Sasha!
>>
>> On Wed, May 10, 2017 at 5:39 PM Sasha Goodman <[email protected]>
>> wrote:
>>
>> >
>> > P.S. This afternoon I streamlined the TextQuoteSelector and
>> > TextPositionSelector to work (in principle ) consistently with Randall
>> > Leed's implementation that used NodeIterator and textContents.
>> >
>> >
>> Neat :).
>>
>> I think my takeaway from the simple example thread, and something of which
>> many of us were likely already well aware, is that there's a desire for a
>> good highlighter implementation. A way to highlight text is often the
>> first
>> example people want to see.
>>
>> While I hope to see experimentation with implementations that try to limit
>> the impact on the DOM, I think <mark> or <span> wrapping of text nodes is
>> still the easiest to understand. In this approach, the actual wrapping is
>> easy. The difficult part is iteration.
>>
>> Now, some quick background on node iteration.
>>
>> I chose to use NodeIterator rather than TreeWalker for my dom-seek library
>> because it meant that the seek function could be stateless, support
>> seeking
>> forward and backward, and still be able to return the number of characters
>> consumed by a seek. The desire to know whether to include the current
>> node's content in the seek count is fulfilled by NodeIterator's
>> "pointerBeforeReferenceNode". Essentially, a NodeIterator stores a point
>> before or after a node, rather than simply a current node.
>>
>> However, using NodeIterator to traverse a Range is not really great. Since
>> it has a read only currentNode, the best that can be done is to start with
>> the commonAncestorContainer of the Range. Range has compareNode,
>> comparePoint, and isPointInRange. I have no idea how expensive these are.
>> Iterating all the nodes under the commonAncestorContainer doesn't feel
>> great to begin with. TreeWalker might be more appropriate since its
>> currentNode could be set to startContainer directly. TreeWalker also
>> appears to have consistent platform support.
>>
>> All of this is complicated by the Range being able to point to offsets
>> within text nodes. For the purposes of highlighting with wrapper elements
>> it's necessary to split the boundary nodes. I think there are probably a
>> number of libraries for this, but I propose we write one under our repo.
>>
>> We might also find that normalizing the endpoints of a Range in some
>> fashion is a helpful prerequisite. There is a library I found that does
>> this, but I found its algorithm terribly confusing. I put time into
>> rewriting it without dependencies. Despite some initial excitement, the
>> author never fully vetted and accepted my pull request:
>> https://github.com/webmodules/range-normalize/pull/2
>>
>> In conclusion, I think there'd be value in bringing some functional
>> utilities into Apache Annotator for dealing with iteration, range
>> splitting, and range normalization, with the goal of providing a very
>> succinct and simple highlighter that looks like this:
>>
>> ```
>> for (const node of textNodes(range)) {
>>   const mark = document.createElement('mark');
>>   node.replaceWith(mark);
>>   mark.appendChild(node);
>> }
>> ```
>>
>> Some care needs to be taken that whatever iteration we use is not
>> invalidated by the replacement of the text node with its wrapper.
>>
>> The fact that a simple example like this is hard to produce is evidence of
>> the underlying complexity described in the above paragraphs. When I see
>> people wanting a simple highlighter what I hear is that they actually need
>> simple abstractions upon which to build a highlighter. The highlighter
>> itself should be easy. Often, highlighters that projects provide are not
>> shipped standalone or don't do exactly what the author needs (use spans
>> instead of marks, add a particular class, coalesce overlapping highlights
>> or not, etc). There is lots of room to do different things but being able
>> to simply get the nodes to be highlighted is the prerequisite task that
>> contains most of the complexity.
>>
>> That's all (and probably way too much) for now. Finding all the tools for
>> all these things is a pain enough that I think we should have a
>> comprehensive set of such utilities in Apache Annotator, even if that
>> risks
>> looking like a bit of NIH syndrome.
>>
>> Unless anyone objects, I think I'll aim to ship libraries for these:
>> - Node iteration (https://github.com/tilgovi/dom-node-iterator)
>> - Tree walking (might not need a library if support is good)
>> - Range splitting
>> - Range normalization (see my pull request reference, above)
>> - Range iterating
>> - Text distance (https://github.com/tilgovi/dom-seek)
>>
>> If anyone wants to start on any of the above, you're welcome to depend on
>> libraries that are outside Apache Annotator. In the case of libraries that
>> I've written, there is value to bringing them into Apache Annotator
>> because
>> they are all written in ES6 but not packaged to be consumed as ES6.
>> Bringing them inside our repo means better code deduplication by tree
>> shaking in tools like rollup and webpack. They could be packaged as ES6
>> where they are, but if I'm going to spend time improving the packaging I
>> would rather just toss out the packaging and get the benefits of the
>> monorepo having all that build/test boilerplate done once for all of them.
>>
>

Re: DOM Iteration (was Re: Just a simple example?)

Reply via email to