On Wed, 2020-11-11 at 18:57 -0500, Graydon Saunders wrote: > Useful keywords; thank you!
The late Gerald Salton of Cornell (i think Cornell) pioneered a lot of ideas in text similarity & clustering, using vector cosines - his idea was to consider each text as a point in an n-dimensional space, where the dimensions are given by the set of distinct words in the corpus, and then to be able to measure the hypothetical angle between lines from the origin to any two given texts. Similarity done this way has a lot of problems, one of which is that "dictionary.txt" turns out ot be "similar" to every other document. In the past i've done something similar to your problem using an algorithn like, for each text t_i for each word w in t_i (in order) for each document d in the collection that contains w link { from: t_i, to: d, value: 1) THen repeat for phrases of two words, three words, four words, where value is the square of the number of words in the phrase, and then add the values for each t_i, d pair, and take the biggest. But this is not a fast algorithm. Faster might be just to take each of your input paragraphs as an "all words" query - "Candidate similar paragraphs: ... [see more]" Liam -- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org