This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 3865971 add JaroWinkler description
3865971 is described below
commit 386597174ea38bb1e75dcb23bdc0a7939bf0103d
Author: Paul King <[email protected]>
AuthorDate: Sun Feb 2 09:35:14 2025 +1000
add JaroWinkler description
---
site/src/site/blog/groovy-text-similarity.adoc | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/site/src/site/blog/groovy-text-similarity.adoc
b/site/src/site/blog/groovy-text-similarity.adoc
index be88b46..c0bf459 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -95,10 +95,13 @@ There are numerous tutorials that describe various string
metric algorithms. We
| The minimum number of "edits" (inserts, deletes, or substitutions) required
to convert from one word to another.
Distance between `kitten` and `sitting` is 3 (swap `s` for `k`, swap `i` for
`e`, add `g` at end).
Distance between `grounds` and `aground` is 2 (add `a` at start, remove `s` at
end).
+https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance[Damerau–Levenshtein
distance]
+is a variant that allows transposition of two adjacent letters to count as a
single edit.
| https://en.wikipedia.org/wiki/Jaccard_index[Jaccard]
| Defines a ratio between two sample sets. This could be sets of
-characters in a word, or words in a sentence. The ratio is the intersection of
sets divided by the union of sets.
+characters in a word, or words in a sentence, or sets of `k` consecutive
characters in a phrase.
+The ratio is the _intersection_ of sets divided by the _union_ of sets.
`bear` vs `bare` would be 100%, `pair` vs `pear` would be 60%.
| https://en.wikipedia.org/wiki/Hamming_distance[Hamming]
@@ -110,6 +113,13 @@ Distance between `grounds` and `aground` is 7 (swap all
chars since none are in
| The maximum number of characters appearing in order in the two words, not
necessarily consecutively.
LCS of `grounds` and `aground` is 6 (`ground`).
LCS of `string` and `single` is 4 (`s`, `i`, `n`, `g`).
+It accounts for insertions and deletions but not substitutions.
+
+| https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance[Jaro–Winkler]
+| This is a metric also measuring edit distance but weights edits to favor
+words with common prefixes.
+JaroWinkler of `ground` and `groudn` (last two letters swapped) is 0.97.
+JaroWinkler of `ground` and `rgound` (first two letters swapped) is 0.94.
|===
@@ -702,9 +712,11 @@ Other referenced sites:
* https://github.com/tdebatty/java-string-similarity
* https://github.com/OpenRefine/OpenRefine
* https://djl.ai/
+*
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec
Related libraries and links:
* https://github.com/EdDuarte/similarity-search-java
* https://github.com/intuit/fuzzy-matcher
* https://www.youtube.com/watch?v=AHlnGId-Y-0
+* https://opensource.com/article/20/12/groovy