This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new d16e2bf  minor tweaks
d16e2bf is described below

commit d16e2bf35ce3cd6dc3da2e3602097a53c0975224
Author: Paul King <[email protected]>
AuthorDate: Tue Feb 18 21:55:56 2025 +1000

    minor tweaks
---
 site/src/site/blog/groovy-text-similarity.adoc | 72 +++++++++++++-------------
 1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/site/src/site/blog/groovy-text-similarity.adoc 
b/site/src/site/blog/groovy-text-similarity.adoc
index 419ed5b..1ef8432 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -1,25 +1,24 @@
 = Groovy Text Similarity
 Paul King <paulk-asert|PMC_Member>; James King <jakingy|Contributor>
 :revdate: 2025-02-18T20:30:00+00:00
-:draft: true
 :keywords: groovy, deep learning, apache commons, phonetics, pytorch, 
tensorflow, codecs, word2vec, djl, deeplearning4j, sts, llm
 :description: This blog looks at processing some algorithms for testing text 
similarity.
 
 == Introduction
 
-Let's build a wordle-like word guessing game. But instead of telling you how 
many
-correct and misplaced letters, we'll give you more hints, but slightly less 
obvious
-ones, to make it a little more challenging!
-
-We won't (directly) even tell you how many letters are in the word,
-but we'll give hints like:
-
-* How close your guess _sounds_ like the hidden word.
-* How close your guess is to the _meaning_ of the hidden word.
-* Instead of correct and misplaced letters, we'll give you some _distance
-and similarity measures_ which will give you clues about how many
-correct letters you have, whether you have the correct letters in order,
-and so forth.
+> Let's build a wordle-like word guessing game. But instead of telling you how 
many
+> correct and misplaced letters, we'll give you more hints, but slightly less 
obvious
+> ones, to make it a little more challenging!
+>
+> We won't (directly) even tell you how many letters are in the word,
+> but we'll give hints like:
+>
+> * How close your guess _sounds_ like the hidden word.
+> * How close your guess is to the _meaning_ of the hidden word.
+> * Instead of correct and misplaced letters, we'll give you some _distance
+> and similarity measures_ which will give you clues about how many
+> correct letters you have, whether you have the correct letters in order,
+> and so forth.
 
 So, we're thinking of a game that is a cross between other games.
 Guessing letters of a word like
@@ -36,7 +35,7 @@ Our goals here aren't to polish a production ready version of 
the game, but to:
 * Show off the latest releases from Apache Commons Text and Apache Commons 
Codec
 * Give you insight into string-metric similarity algorithms
 * Give you insight into phonetic similarity algorithms
-* Give you insight into semantic textual similarity (STS) algorithms powered 
by machine learning and deep neural networks using technologies like PyTorch, 
Tensorflow, and Word2vec
+* Give you insight into semantic textual similarity (STS) algorithms powered 
by machine learning and deep neural networks using frameworks like PyTorch, 
TensorFlow, and Word2vec, and technologies like large language models (LLMs) 
and BERT
 * Highlight how easy it is to play with the above technologies using Apache 
Groovy
 
 If you are new to Groovy, consider checking out this
@@ -99,7 +98,7 @@ Then we'll look at some libraries for phonetic matching:
 
 Then we'll look at some deep learning options for semantic matching:
 
-* `org.deeplearning4j:deeplearning4j-nlp` for GloVe, ConceptNet, and FastText 
models
+* `org.deeplearning4j:deeplearning4j-nlp` for GloVe, ConceptNet, and fastText 
models
 * `ai.djl` with Pytorch for a universal-sentence-encoder model and Tensorflow 
with an AnglE model
 
 == Simple String Metrics
@@ -630,7 +629,7 @@ are applicable in all contexts (very roughly).
 We'll look at three models which use this approach.
 https://en.wikipedia.org/wiki/Word2vec[Word2vec] by Google Research,
 https://nlp.stanford.edu/projects/glove/[GloVe] by Stanford NLP, and
-https://fasttext.cc/[FastText] by Facebook Research.
+https://fasttext.cc/[fastText] by Facebook Research.
 We'll use
 
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
 to load and use these models.
@@ -650,7 +649,7 @@ model is trained and optimized for greater-than-word length 
text, such as senten
 https://www.tensorflow.org/[TensorFlow].
 We'll use the https://djl.ai/[Deep Java Library] to load and use both of these 
models on the JDK.
 
-=== FastText
+=== fastText
 Many Word2Vec libraries can read and write models to a standard file format 
based on the original https://github.com/tmikolov/word2vec[Google 
implementation].
 
 For example, we can download some of the pre-trained models from the 
https://radimrehurek.com/gensim/models/word2vec.html[Gensim Python library], 
convert them to the Word2Vec format, and then open them with 
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J].
@@ -662,10 +661,10 @@ glove_vectors = 
gensim.downloader.load('fasttext-wiki-news-subwords-300')
 glove_vectors.save_word2vec_format("fasttext-wiki-news-subwords-300.bin", 
binary=True)
 ----
 
-https://huggingface.co/fse/fasttext-wiki-news-subwords-300[This model] has
+This https://huggingface.co/fse/fasttext-wiki-news-subwords-300[fastText 
model] has
 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and 
statmt.org news dataset (16B tokens).
 
-We simply serialize the https://fasttext.cc/[FastText] model into our Word2Vec 
representation,
+We simply serialize the https://fasttext.cc/[fastText] model into our Word2Vec 
representation,
 and can then call methods like `similarity` and `wordsNearest` as shown here:
 
 [source,groovy]
@@ -674,7 +673,7 @@ var modelName = 'fasttext-wiki-news-subwords-300.bin'
 var path = 
Paths.get(ConceptNet.classLoader.getResource(modelName).toURI()).toFile()
 Word2Vec model = WordVectorSerializer.readWord2VecModel(path)
 String[] words = ['bull', 'calf', 'bovine', 'cattle', 'livestock', 'horse']
-println """GloVe similarity to cow: ${
+println """fastText similarity to cow: ${
     words
         .collectEntries { [it, model.similarity('cow', it)] }
         .sort { -it.value }
@@ -687,7 +686,7 @@ Nearest words in vocab: ${model.wordsNearest('cow', 4)}
 Which gives this output:
 
 ----
-FastText similarity to cow: [bovine:0.72, cattle:0.70, calf:0.67, bull:0.67, 
livestock:0.61, horse:0.60]
+fastText similarity to cow: [bovine:0.72, cattle:0.70, calf:0.67, bull:0.67, 
livestock:0.61, horse:0.60]
 Nearest words in vocab: [cows, goat, pig, bovine]
 ----
 
@@ -1177,7 +1176,7 @@ Levenshtein                    Distance: 10, Insert: 0, 
Delete: 3, Substitute: 7
 Jaccard                        0%
 JaroWinkler                    PREFIX 0% / SUFFIX 0%
 Phonetic                       Metaphone=AFTRXK 47% / Soundex=A136 0%
-Meaning                        AnglE 45% / Use 21% / ConceptNet 2% / GloVe -4% 
/ FastText 19%
+Meaning                        AnglE 45% / Use 21% / ConceptNet 2% / GloVe -4% 
/ fastText 19%
 ----
 
 It looks like we really bombed out, but in fact this is good news. What did we 
learn:
@@ -1205,7 +1204,7 @@ Levenshtein                    Distance: 6, Insert: 2, 
Delete: 0, Substitute: 4
 Jaccard                        22%
 JaroWinkler                    PREFIX 56% / SUFFIX 45%
 Phonetic                       Metaphone=FRT 39% / Soundex=F630 0%
-Meaning                        AnglE 64% / Use 41% / ConceptNet 37% / GloVe 
31% / FastText 44%
+Meaning                        AnglE 64% / Use 41% / ConceptNet 37% / GloVe 
31% / fastText 44%
 ----
 
 What did we learn?
@@ -1233,7 +1232,7 @@ Levenshtein                    Distance: 1, Insert: 0, 
Delete: 0, Substitute: 1
 Jaccard                        71%
 JaroWinkler                    PREFIX 90% / SUFFIX 96%
 Phonetic                       Metaphone=BTNK 79% / Soundex=B352 75%
-Meaning                        AnglE 52% / Use 35% / ConceptNet 2% / GloVe 4% 
/ FastText 25%
+Meaning                        AnglE 52% / Use 35% / ConceptNet 2% / GloVe 4% 
/ fastText 25%
 ----
 
 We have 6 letters right in a row and 5 of the 6 distinct letters.
@@ -1249,7 +1248,7 @@ Levenshtein                    Distance: 0, Insert: 0, 
Delete: 0, Substitute: 0
 Jaccard                        100%
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=PTNK 100% / Soundex=P352 100%
-Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
+Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / fastText 100%
 
 Congratulations, you guessed correctly!
 ----
@@ -1264,7 +1263,7 @@ Levenshtein                    Distance: 7, Insert: 4, 
Delete: 0, Substitute: 3
 Jaccard                        22%
 JaroWinkler                    PREFIX 42% / SUFFIX 46%
 Phonetic                       Metaphone=BL 38% / Soundex=B400 25%
-Meaning                        AnglE 46% / Use 40% / ConceptNet 0% / GloVe 0% 
/ FastText 31%
+Meaning                        AnglE 46% / Use 40% / ConceptNet 0% / GloVe 0% 
/ fastText 31%
 ----
 * Since LCS is 1, [fuchsia]#the letters shared with the hidden word are in the 
reverse order#.
 * There were 4 inserts and 0 deletes which means [fuchsia]#the hidden word has 
8 letters#.
@@ -1277,7 +1276,7 @@ Levenshtein                    Distance: 6, Insert: 5, 
Delete: 0, Substitute: 1
 Jaccard                        25%
 JaroWinkler                    PREFIX 47% / SUFFIX 0%
 Phonetic                       Metaphone=LK 38% / Soundex=L200 0%
-Meaning                        AnglE 50% / Use 18% / ConceptNet 11% / GloVe 
13% / FastText 37%
+Meaning                        AnglE 50% / Use 18% / ConceptNet 11% / GloVe 
13% / fastText 37%
 ----
 * Jaccard of 2 / 8 tells us [fuchsia]#two of the letters in 'leg' appear in 
the hidden word#.
 * LCS of 2 tells us that [fuchsia]#they appear in the same order as in the 
hidden word#.
@@ -1294,7 +1293,7 @@ Levenshtein                    Distance: 8, Insert: 0, 
Delete: 0, Substitute: 8
 Jaccard                        15%
 JaroWinkler                    PREFIX 50% / SUFFIX 50%
 Phonetic                       Metaphone=LNKX 34% / Soundex=L522 0%
-Meaning                        AnglE 46% / Use 12% / ConceptNet -11% / GloVe 
-4% / FastText 25%
+Meaning                        AnglE 46% / Use 12% / ConceptNet -11% / GloVe 
-4% / fastText 25%
 ----
 * 8 substitutions means [fuchsia]#none of the letters are in the same spot as 
'languish'#.
 
@@ -1307,7 +1306,7 @@ Levenshtein                    Distance: 4, Insert: 0, 
Delete: 0, Substitute: 4
 Jaccard                        40%
 JaroWinkler                    PREFIX 83% / SUFFIX 75%
 Phonetic                       Metaphone=ELKXN 50% / Soundex=E423 75%
-Meaning                        AnglE 47% / Use 13% / ConceptNet -5% / GloVe 
-7% / FastText 26%
+Meaning                        AnglE 47% / Use 13% / ConceptNet -5% / GloVe 
-7% / fastText 26%
 ----
 * Jaccard tells us we have 4 distinct letters shared with the hidden word and 
yet we have a LCS of 5. [fuchsia]#The duplicate 'E' must be correct and the 
order of all correct letters must match the hidden word.#
 * Only 4 substitutions means [fuchsia]#8-4=4 letters are in the correct 
position#.
@@ -1323,7 +1322,7 @@ Levenshtein                    Distance: 0, Insert: 0, 
Delete: 0, Substitute: 0
 Jaccard                        100%
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=ELFTR 100% / Soundex=E413 100%
-Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
+Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / fastText 100%
 
 Congratulations, you guessed correctly!
 ----
@@ -1338,7 +1337,7 @@ Levenshtein                    Distance: 8, Insert: 0, 
Delete: 4, Substitute: 4
 Jaccard                        50%
 JaroWinkler                    PREFIX 61% / SUFFIX 49%
 Phonetic                       Metaphone=AFTRXK 33% / Soundex=A136 25%
-Meaning                        AnglE 44% / Use 11% / ConceptNet -7% / GloVe 1% 
/ FastText 15%
+Meaning                        AnglE 44% / Use 11% / ConceptNet -7% / GloVe 1% 
/ fastText 15%
 ----
 
 What do we know?
@@ -1360,7 +1359,7 @@ Levenshtein                    Distance: 4, Insert: 0, 
Delete: 0, Substitute: 4
 Jaccard                        57%
 JaroWinkler                    PREFIX 67% / SUFFIX 67%
 Phonetic                       Metaphone=KRS 74% / Soundex=C620 75%
-Meaning                        AnglE 51% / Use 12% / ConceptNet 5% / GloVe 23% 
/ FastText 26%
+Meaning                        AnglE 51% / Use 12% / ConceptNet 5% / GloVe 23% 
/ fastText 26%
 ----
 
 This tells us:
@@ -1381,7 +1380,7 @@ Levenshtein                    Distance: 6, Insert: 0, 
Delete: 0, Substitute: 6
 Jaccard                        67%
 JaroWinkler                    PREFIX 56% / SUFFIX 56%
 Phonetic                       Metaphone=RSTS 61% / Soundex=R232 25%
-Meaning                        AnglE 54% / Use 25% / ConceptNet 18% / GloVe 
18% / FastText 31%
+Meaning                        AnglE 54% / Use 25% / ConceptNet 18% / GloVe 
18% / fastText 31%
 ----
 
 We learned:
@@ -1401,7 +1400,7 @@ Levenshtein                    Distance: 0, Insert: 0, 
Delete: 0, Substitute: 0
 Jaccard                        100%
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=KRT 100% / Soundex=C630 100%
-Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
+Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / fastText 100%
 
 Congratulations, you guessed correctly!
 ----
@@ -1444,6 +1443,9 @@ a chart something like this (some guesses and hints for 
Round 3 shown):
 
 image:img/gameBubble.png[Game BubleChart,width=70%]
 
+But these are just ideas. A production ready game is for another time.
+We hope you have enjoyed playing along and learning a little more about text 
similarity.
+
 == Further information [[further_info]]
 
 Source code for this post:

Reply via email to