Sasha. Thanks. Fyi. Demo works on Chrome. Not on Firefox.
On Tue, May 16, 2017 at 12:49 AM, Sasha Goodman <[email protected]> wrote: > Here is a demo of simple annotation, thanks to Benjamin: > > https://predict-r.github.io/annotation-model/ > > > On Fri, May 12, 2017 at 12:19 PM Sasha Goodman <[email protected]> > wrote: > >> I would be delighted if my efforts were useful in this project!!! >> Regarding that code, if any parts are used it would make my week. The class >> structure is sorta self-documented by the standard, and combined with >> builders the classes it can accommodate a variety of motives. >> >> Highlighting is the most common motive now (correct me if I'm wrong). My >> gut-feeling is that to get the support and time of hard core annotators, >> the code needs to accommodate the idiosyncrasies of highlighting first. For >> example, if there are thousands of highlights on a page, an annotation >> builder might iterate/walk the document just once and fill in the thousands >> of highlights in one pass. Also, a highlighting app would probably need to >> modify the source document by inserting spans and such. >> >> If Randall needs familiar code for node iteration, tree walking, range >> splitting, string similarity and normalization, that's cool! Custom code, >> *especially* Polyfill type implementations, could smooth over browser >> idiosyncrasies. Also, I saw a Jsperf.com microbenchmark that put custom >> walkers on par with the native browser based ones. >> >> On a personal note, I do archival work and did not initially see the value >> in modifying the source document by inserting spans (however, a highlight >> app would need that). The main reason I'm excited about annotation is its >> value for labeling data for text analysis and machine learning. A lot of >> the advancements in machine learning are because of large bodies of data >> that have been tagged. The most common examples are usually of images that >> have regions selected and then labeled, but annotation could also help turn >> semi-structured text into more structured text data (e.g. for labeling >> parts of government documents). For archival work on mostly static >> documents, there does not seem to be a need to modify source document. On >> the other hand, for dynamically changing documents, inserting spans with >> unique IDs seems appropriate because its more robust to document changes. >> Yet, it is also vulnerable to turf battles with other extensions and the >> page's own javascript, so I hope it's not a requirement of the Apache >> library but rather a feature. >> >> >> On Thu, May 11, 2017 at 1:43 PM Benjamin Young <[email protected]> >> wrote: >> >>> Exciting to see this conversation happening. ^_^ >>> >>> >>> Randall, how feasible would it be to bring (soon) your libraries (even >>> via copy/paste) into the Apache Annotator repo. I believe (according to >>> GitHub) you're author/owner of 90%+ of the code in them, and (consequently) >>> able to do that if you believe that's the right step. >>> >>> >>> Sasha you're classes modeled around the selector and a "builder" sound >>> very similar to the hopes I wrote up in >>> https://cwiki.apache.org/confluence/display/ANNO/Planning >>> >>> >>> I'd very much like to combine these efforts in some way. >>> >>> >>> Additionally--and the thing driving me personally at the moment--I have >>> to present on Apache Annotator next Wednesday! >>> >>> https://apachecon2017.sched.com/event/AbBW >>> >>> >>> Consequently, I'd very much love it if we (collectively) could build a >>> demo together! There's plenty to talk about wrt to annotation, community >>> building, Web Annotation Data Model & Protocol, as well as why (those of us >>> that are here at least) have chosen to start collaborating at the ASF. >>> >>> >>> At any rate, I plan to be coding on all the things leading up to >>> Wednesday, so any help, input, pointers, and code (hehe) that anyone wants >>> to toss in ahead of my codez, I'd be most grateful to code together! >>> >>> >>> Thanks, all! >>> >>> Benjamin >>> >>> -- >>> >>> http://bigbluehat.com/ >>> >>> http://linkedin.com/in/benjaminyoung >>> >>> ________________________________ >>> From: Randall Leeds <[email protected]> >>> Sent: Thursday, May 11, 2017 3:34:24 PM >>> To: [email protected] >>> Subject: DOM Iteration (was Re: Just a simple example?) >>> >>> Great to see you here, Sasha! >>> >>> On Wed, May 10, 2017 at 5:39 PM Sasha Goodman <[email protected]> >>> wrote: >>> >>> > >>> > P.S. This afternoon I streamlined the TextQuoteSelector and >>> > TextPositionSelector to work (in principle ) consistently with Randall >>> > Leed's implementation that used NodeIterator and textContents. >>> > >>> > >>> Neat :). >>> >>> I think my takeaway from the simple example thread, and something of which >>> many of us were likely already well aware, is that there's a desire for a >>> good highlighter implementation. A way to highlight text is often the >>> first >>> example people want to see. >>> >>> While I hope to see experimentation with implementations that try to limit >>> the impact on the DOM, I think <mark> or <span> wrapping of text nodes is >>> still the easiest to understand. In this approach, the actual wrapping is >>> easy. The difficult part is iteration. >>> >>> Now, some quick background on node iteration. >>> >>> I chose to use NodeIterator rather than TreeWalker for my dom-seek library >>> because it meant that the seek function could be stateless, support >>> seeking >>> forward and backward, and still be able to return the number of characters >>> consumed by a seek. The desire to know whether to include the current >>> node's content in the seek count is fulfilled by NodeIterator's >>> "pointerBeforeReferenceNode". Essentially, a NodeIterator stores a point >>> before or after a node, rather than simply a current node. >>> >>> However, using NodeIterator to traverse a Range is not really great. Since >>> it has a read only currentNode, the best that can be done is to start with >>> the commonAncestorContainer of the Range. Range has compareNode, >>> comparePoint, and isPointInRange. I have no idea how expensive these are. >>> Iterating all the nodes under the commonAncestorContainer doesn't feel >>> great to begin with. TreeWalker might be more appropriate since its >>> currentNode could be set to startContainer directly. TreeWalker also >>> appears to have consistent platform support. >>> >>> All of this is complicated by the Range being able to point to offsets >>> within text nodes. For the purposes of highlighting with wrapper elements >>> it's necessary to split the boundary nodes. I think there are probably a >>> number of libraries for this, but I propose we write one under our repo. >>> >>> We might also find that normalizing the endpoints of a Range in some >>> fashion is a helpful prerequisite. There is a library I found that does >>> this, but I found its algorithm terribly confusing. I put time into >>> rewriting it without dependencies. Despite some initial excitement, the >>> author never fully vetted and accepted my pull request: >>> https://github.com/webmodules/range-normalize/pull/2 >>> >>> In conclusion, I think there'd be value in bringing some functional >>> utilities into Apache Annotator for dealing with iteration, range >>> splitting, and range normalization, with the goal of providing a very >>> succinct and simple highlighter that looks like this: >>> >>> ``` >>> for (const node of textNodes(range)) { >>> const mark = document.createElement('mark'); >>> node.replaceWith(mark); >>> mark.appendChild(node); >>> } >>> ``` >>> >>> Some care needs to be taken that whatever iteration we use is not >>> invalidated by the replacement of the text node with its wrapper. >>> >>> The fact that a simple example like this is hard to produce is evidence of >>> the underlying complexity described in the above paragraphs. When I see >>> people wanting a simple highlighter what I hear is that they actually need >>> simple abstractions upon which to build a highlighter. The highlighter >>> itself should be easy. Often, highlighters that projects provide are not >>> shipped standalone or don't do exactly what the author needs (use spans >>> instead of marks, add a particular class, coalesce overlapping highlights >>> or not, etc). There is lots of room to do different things but being able >>> to simply get the nodes to be highlighted is the prerequisite task that >>> contains most of the complexity. >>> >>> That's all (and probably way too much) for now. Finding all the tools for >>> all these things is a pain enough that I think we should have a >>> comprehensive set of such utilities in Apache Annotator, even if that >>> risks >>> looking like a bit of NIH syndrome. >>> >>> Unless anyone objects, I think I'll aim to ship libraries for these: >>> - Node iteration (https://github.com/tilgovi/dom-node-iterator) >>> - Tree walking (might not need a library if support is good) >>> - Range splitting >>> - Range normalization (see my pull request reference, above) >>> - Range iterating >>> - Text distance (https://github.com/tilgovi/dom-seek) >>> >>> If anyone wants to start on any of the above, you're welcome to depend on >>> libraries that are outside Apache Annotator. In the case of libraries that >>> I've written, there is value to bringing them into Apache Annotator >>> because >>> they are all written in ES6 but not packaged to be consumed as ES6. >>> Bringing them inside our repo means better code deduplication by tree >>> shaking in tools like rollup and webpack. They could be packaged as ES6 >>> where they are, but if I'm going to spend time improving the packaging I >>> would rather just toss out the packaging and get the benefits of the >>> monorepo having all that build/test boilerplate done once for all of them. >>> >>
