Treora commented on issue #83:
URL: 
https://github.com/apache/incubator-annotator/issues/83#issuecomment-750943153


   > Prior art we could borrow from:
   > 
   > * https://github.com/tilgovi/dom-anchor-text-quote (uses diff-match-patch)
   > * https://github.com/robertknight/approx-string-match-js/
   
   Update: @judell kindly informed us that Hypothesis now switched from the 
former to the latter. See https://github.com/hypothesis/client/pull/2814 and 
https://github.com/hypothesis/client/pull/2779
   
   I’d be eager to see how they compare (at least it’s supposed to be much 
faster now!), what could still be improved, etc.
   
   Some observations from looking at how exactly approx-string-match-js is 
being used in H:
   
   - A nifty choice is that its new implementation [gives a score using 
weights][1] such that it is fuzzier when matching the prefix and suffix than 
when it matches the exact quote.
   - Also it allows giving a hint at which position in the text the quote is 
expected, giving a penalty to matches that are further away from that position; 
this elegantly enables combining the information from a TextQuoteSelector and 
TextPositionSelector.
   - The score for each matched string (before weighting) seems to be 
[calculated][2] as 1 minus the number of errors (i.e. its levenshtein distance) 
normalised by the string’s length. This makes sense, though I wonder if it 
might cause e.g. a one-character prefix to have an unfairly heavy influence on 
the match score (perhaps not significant, but dropping this thought here for 
later).
   
   CC @robertknight (happy to hear if you have more research notes, 
experimental results or other relevant tips from your experience developing 
this!)
   
   [1]: 
https://github.com/hypothesis/client/blob/0c2871ab98e6cf0a2bbfdb4d0aba439a3ba9039a/src/annotator/anchoring/match-quote.js#L109-L145
   [2]: 
https://github.com/hypothesis/client/blob/0c2871ab98e6cf0a2bbfdb4d0aba439a3ba9039a/src/annotator/anchoring/match-quote.js#L55-L64


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to