[GitHub] [incubator-annotator] robertknight edited a comment on issue #83: Fuzzy text quote matching

GitBox Thu, 24 Dec 2020 14:31:37 -0800


robertknight edited a comment on issue #83:
URL: 
https://github.com/apache/incubator-annotator/issues/83#issuecomment-751125582



   > CC @robertknight (happy to hear if you have more research notes, 
experimental results or other relevant tips from your experience developing 
this!)
   
   Aside from taking ideas from Hypothesis's technical implementation, which 
you've already posted pointers to, the other resource I would suggest to make 
use of from Hypothesis are datasets of annotations in the "Public" channel. 
Here are some I found useful for testing quote matching performance and 
accuracy:
   
   - The American Yawp project: http://www.americanyawp.com. In particular, the 
early chapters have a lot of public annotations
   - Annotations on Wikipedia: 
https://hypothes.is/search?q=url%3Ahttps%3A%2F%2Fen.wikipedia.org%2F*. In 
particular, check out articles which have many annotations made on older 
versions (say from 2018 or earlier) and have had many edits since then
   
   The new quote matching implementation in Hypothesis has a couple of areas 
where we've noticed matching quality can be improved:
   
   1. It can find spurious matches for short quotes (in particular, those of 
one or two words). In [the PR](https://github.com/hypothesis/client/pull/2779) 
I mention a couple of examples.
   2. In the case where the match is not exact, alignment can be sub-optimal in 
some cases. Looking at public Hypothesis annotations on 
http://www.americanyawp.com/text/01-the-new-world/ for example you can find 
cases where the Hypothesis client draws highlights that start or end in 
unlikely places (eg. middle of a word).
   
   Related to point (1), one of the goals of the new implementation was to try 
to make it easier for other Hypothesis developers and staff to understand how 
exactly the "fuzzy" aspect of "fuzzy matching" works. The thinking is that if 
it is imperfect, then there is value in at least being predictable.
   
   In terms of performance, the new implementation is indeed a lot faster in 
the worst case where there are many selectors that either do not match at all 
or match with significant edits. The actual approximate string matching code is 
pretty well optimized at this point. The lowest-hanging fruit is optimizing the 
extraction of text from the document and mapping between text positions and DOM 
(node, offset) points. If we find that we need to make significant improvements 
from the current implementation in future then we'd likely need to do some 
offline processing of the document text before searching for matches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-annotator] robertknight edited a comment on issue #83: Fuzzy text quote matching

Reply via email to