Treora commented on issue #83: URL: https://github.com/apache/incubator-annotator/issues/83#issuecomment-750943153
> Prior art we could borrow from: > > * https://github.com/tilgovi/dom-anchor-text-quote (uses diff-match-patch) > * https://github.com/robertknight/approx-string-match-js/ Update: @judell kindly informed us that Hypothesis now switched from the former to the latter. See https://github.com/hypothesis/client/pull/2814 and https://github.com/hypothesis/client/pull/2779 I’d be eager to see how they compare (at least it’s supposed to be much faster now!), what could still be improved, etc. Some observations from looking at how exactly approx-string-match-js is being used in H: - A nifty choice is that its new implementation [gives a score using weights][1] such that it is fuzzier when matching the prefix and suffix than when it matches the exact quote. - Also it allows giving a hint at which position in the text the quote is expected, giving a penalty to matches that are further away from that position; this elegantly enables combining the information from a TextQuoteSelector and TextPositionSelector. - The score for each matched string (before weighting) seems to be [calculated][2] as 1 minus the number of errors (i.e. its levenshtein distance) normalised by the string’s length. This makes sense, though I wonder if it might cause e.g. a one-character prefix to have an unfairly heavy influence on the match score (perhaps not significant, but dropping this thought here for later). CC @robertknight (happy to hear if you have more research notes, experimental results or other relevant tips from your experience developing this!) [1]: https://github.com/hypothesis/client/blob/0c2871ab98e6cf0a2bbfdb4d0aba439a3ba9039a/src/annotator/anchoring/match-quote.js#L109-L145 [2]: https://github.com/hypothesis/client/blob/0c2871ab98e6cf0a2bbfdb4d0aba439a3ba9039a/src/annotator/anchoring/match-quote.js#L55-L64 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org