Here is a demo of simple annotation, thanks to Benjamin: https://predict-r.github.io/annotation-model/
On Fri, May 12, 2017 at 12:19 PM Sasha Goodman <[email protected]> wrote: > I would be delighted if my efforts were useful in this project!!! > Regarding that code, if any parts are used it would make my week. The class > structure is sorta self-documented by the standard, and combined with > builders the classes it can accommodate a variety of motives. > > Highlighting is the most common motive now (correct me if I'm wrong). My > gut-feeling is that to get the support and time of hard core annotators, > the code needs to accommodate the idiosyncrasies of highlighting first. For > example, if there are thousands of highlights on a page, an annotation > builder might iterate/walk the document just once and fill in the thousands > of highlights in one pass. Also, a highlighting app would probably need to > modify the source document by inserting spans and such. > > If Randall needs familiar code for node iteration, tree walking, range > splitting, string similarity and normalization, that's cool! Custom code, > *especially* Polyfill type implementations, could smooth over browser > idiosyncrasies. Also, I saw a Jsperf.com microbenchmark that put custom > walkers on par with the native browser based ones. > > On a personal note, I do archival work and did not initially see the value > in modifying the source document by inserting spans (however, a highlight > app would need that). The main reason I'm excited about annotation is its > value for labeling data for text analysis and machine learning. A lot of > the advancements in machine learning are because of large bodies of data > that have been tagged. The most common examples are usually of images that > have regions selected and then labeled, but annotation could also help turn > semi-structured text into more structured text data (e.g. for labeling > parts of government documents). For archival work on mostly static > documents, there does not seem to be a need to modify source document. On > the other hand, for dynamically changing documents, inserting spans with > unique IDs seems appropriate because its more robust to document changes. > Yet, it is also vulnerable to turf battles with other extensions and the > page's own javascript, so I hope it's not a requirement of the Apache > library but rather a feature. > > > On Thu, May 11, 2017 at 1:43 PM Benjamin Young <[email protected]> > wrote: > >> Exciting to see this conversation happening. ^_^ >> >> >> Randall, how feasible would it be to bring (soon) your libraries (even >> via copy/paste) into the Apache Annotator repo. I believe (according to >> GitHub) you're author/owner of 90%+ of the code in them, and (consequently) >> able to do that if you believe that's the right step. >> >> >> Sasha you're classes modeled around the selector and a "builder" sound >> very similar to the hopes I wrote up in >> https://cwiki.apache.org/confluence/display/ANNO/Planning >> >> >> I'd very much like to combine these efforts in some way. >> >> >> Additionally--and the thing driving me personally at the moment--I have >> to present on Apache Annotator next Wednesday! >> >> https://apachecon2017.sched.com/event/AbBW >> >> >> Consequently, I'd very much love it if we (collectively) could build a >> demo together! There's plenty to talk about wrt to annotation, community >> building, Web Annotation Data Model & Protocol, as well as why (those of us >> that are here at least) have chosen to start collaborating at the ASF. >> >> >> At any rate, I plan to be coding on all the things leading up to >> Wednesday, so any help, input, pointers, and code (hehe) that anyone wants >> to toss in ahead of my codez, I'd be most grateful to code together! >> >> >> Thanks, all! >> >> Benjamin >> >> -- >> >> http://bigbluehat.com/ >> >> http://linkedin.com/in/benjaminyoung >> >> ________________________________ >> From: Randall Leeds <[email protected]> >> Sent: Thursday, May 11, 2017 3:34:24 PM >> To: [email protected] >> Subject: DOM Iteration (was Re: Just a simple example?) >> >> Great to see you here, Sasha! >> >> On Wed, May 10, 2017 at 5:39 PM Sasha Goodman <[email protected]> >> wrote: >> >> > >> > P.S. This afternoon I streamlined the TextQuoteSelector and >> > TextPositionSelector to work (in principle ) consistently with Randall >> > Leed's implementation that used NodeIterator and textContents. >> > >> > >> Neat :). >> >> I think my takeaway from the simple example thread, and something of which >> many of us were likely already well aware, is that there's a desire for a >> good highlighter implementation. A way to highlight text is often the >> first >> example people want to see. >> >> While I hope to see experimentation with implementations that try to limit >> the impact on the DOM, I think <mark> or <span> wrapping of text nodes is >> still the easiest to understand. In this approach, the actual wrapping is >> easy. The difficult part is iteration. >> >> Now, some quick background on node iteration. >> >> I chose to use NodeIterator rather than TreeWalker for my dom-seek library >> because it meant that the seek function could be stateless, support >> seeking >> forward and backward, and still be able to return the number of characters >> consumed by a seek. The desire to know whether to include the current >> node's content in the seek count is fulfilled by NodeIterator's >> "pointerBeforeReferenceNode". Essentially, a NodeIterator stores a point >> before or after a node, rather than simply a current node. >> >> However, using NodeIterator to traverse a Range is not really great. Since >> it has a read only currentNode, the best that can be done is to start with >> the commonAncestorContainer of the Range. Range has compareNode, >> comparePoint, and isPointInRange. I have no idea how expensive these are. >> Iterating all the nodes under the commonAncestorContainer doesn't feel >> great to begin with. TreeWalker might be more appropriate since its >> currentNode could be set to startContainer directly. TreeWalker also >> appears to have consistent platform support. >> >> All of this is complicated by the Range being able to point to offsets >> within text nodes. For the purposes of highlighting with wrapper elements >> it's necessary to split the boundary nodes. I think there are probably a >> number of libraries for this, but I propose we write one under our repo. >> >> We might also find that normalizing the endpoints of a Range in some >> fashion is a helpful prerequisite. There is a library I found that does >> this, but I found its algorithm terribly confusing. I put time into >> rewriting it without dependencies. Despite some initial excitement, the >> author never fully vetted and accepted my pull request: >> https://github.com/webmodules/range-normalize/pull/2 >> >> In conclusion, I think there'd be value in bringing some functional >> utilities into Apache Annotator for dealing with iteration, range >> splitting, and range normalization, with the goal of providing a very >> succinct and simple highlighter that looks like this: >> >> ``` >> for (const node of textNodes(range)) { >> const mark = document.createElement('mark'); >> node.replaceWith(mark); >> mark.appendChild(node); >> } >> ``` >> >> Some care needs to be taken that whatever iteration we use is not >> invalidated by the replacement of the text node with its wrapper. >> >> The fact that a simple example like this is hard to produce is evidence of >> the underlying complexity described in the above paragraphs. When I see >> people wanting a simple highlighter what I hear is that they actually need >> simple abstractions upon which to build a highlighter. The highlighter >> itself should be easy. Often, highlighters that projects provide are not >> shipped standalone or don't do exactly what the author needs (use spans >> instead of marks, add a particular class, coalesce overlapping highlights >> or not, etc). There is lots of room to do different things but being able >> to simply get the nodes to be highlighted is the prerequisite task that >> contains most of the complexity. >> >> That's all (and probably way too much) for now. Finding all the tools for >> all these things is a pain enough that I think we should have a >> comprehensive set of such utilities in Apache Annotator, even if that >> risks >> looking like a bit of NIH syndrome. >> >> Unless anyone objects, I think I'll aim to ship libraries for these: >> - Node iteration (https://github.com/tilgovi/dom-node-iterator) >> - Tree walking (might not need a library if support is good) >> - Range splitting >> - Range normalization (see my pull request reference, above) >> - Range iterating >> - Text distance (https://github.com/tilgovi/dom-seek) >> >> If anyone wants to start on any of the above, you're welcome to depend on >> libraries that are outside Apache Annotator. In the case of libraries that >> I've written, there is value to bringing them into Apache Annotator >> because >> they are all written in ES6 but not packaged to be consumed as ES6. >> Bringing them inside our repo means better code deduplication by tree >> shaking in tools like rollup and webpack. They could be packaged as ES6 >> where they are, but if I'm going to spend time improving the packaging I >> would rather just toss out the packaging and get the benefits of the >> monorepo having all that build/test boilerplate done once for all of them. >> >
