On Wed, 2020-11-11 at 18:57 -0500, Graydon Saunders wrote:
> Useful keywords; thank you!

The late Gerald Salton of Cornell  (i think Cornell) pioneered a lot of
ideas in text similarity & clustering, using vector cosines - his idea
was to consider each text as a point in an n-dimensional space, where
the dimensions are  given by the set of distinct words in the corpus,
and then to be able to measure the hypothetical angle between lines
from the origin to any two given texts.

Similarity done this way has a lot of problems, one of which is that
"dictionary.txt" turns out ot be "similar" to every other document.

In the past i've done something similar to your problem using an
algorithn like,
  for each text t_i
    for each word w in t_i (in order)
      for each document d in the collection that contains w
        link { from: t_i, to: d, value: 1)

THen repeat for phrases of two words, three words, four words, where
value is the square of the number of  words in the phrase, and then add
the values  for each t_i, d pair, and take the biggest.

But this is not a fast  algorithm.

Faster might be just to take each of your input paragraphs as an "all
words" query - "Candidate similar paragraphs: ... [see more]"

Liam
        

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org

Reply via email to