This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 1da26cb minor tweaks
1da26cb is described below
commit 1da26cbed4e852701685c565b02c816b034ba2ca
Author: Paul King <[email protected]>
AuthorDate: Tue Feb 18 18:59:07 2025 +1000
minor tweaks
---
site/src/site/blog/groovy-text-similarity.adoc | 40 +++++++++++++++-----------
1 file changed, 24 insertions(+), 16 deletions(-)
diff --git a/site/src/site/blog/groovy-text-similarity.adoc
b/site/src/site/blog/groovy-text-similarity.adoc
index 36aadb7..d262508 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -2,7 +2,7 @@
Paul King <paulk-asert|PMC_Member>; James King <jakingy|Contributor>
:revdate: 2025-02-18T20:30:00+00:00
:draft: true
-:keywords: groovy, deep learning, apache commons, phonetics, pytorch,
tensorflow, codecs, word2vec, djl, deeplearning4j
+:keywords: groovy, deep learning, apache commons, phonetics, pytorch,
tensorflow, codecs, word2vec, djl, deeplearning4j, sts, llm
:description: This blog looks at processing some algorithms for testing text
similarity.
== Introduction
@@ -22,21 +22,22 @@ correct letters you have, whether you have the correct
letters in order,
and so forth.
So, we're thinking of a game that is a cross between other games.
-Guessing letters like
+Guessing letters of a word like
https://www.nytimes.com/games/wordle/index.html[Wordle],
-but with less direct clues, sort of like
-https://en.wikipedia.org/wiki/Mastermind_(board_game)[Master Mind], and
-incorporating some of the ideas behind guessing words by semantic meaning like
+but with less direct clues, sort of like how a black key peg in
+https://en.wikipedia.org/wiki/Mastermind_(board_game)[Master Mind] indicates
that you
+have one of the colored code pegs in the correct position, but you don't know
which one.
+It also will incorporate some of the ideas behind word-guessing games like
https://semantle.com/[Semantle], or
-https://proximity.clevergoat.com/[Proximity].
+https://proximity.clevergoat.com/[Proximity] which also use semantic meaning.
Our goals here aren't to polish a production ready version of the game, but to:
* Show off the latest releases from Apache Commons Text and Apache Commons
Codec
* Give you insight into string-metric similarity algorithms
* Give you insight into phonetic similarity algorithms
-* Give you insight into semantic similarity algorithms powered by machine
learning and deep neural networks using technologies like PyTorch, Tensorflow,
and Word2vec
-* To highlight how easy it is to play with the above technologies using Apache
Groovy
+* Give you insight into semantic textual similarity (STS) algorithms powered
by machine learning and deep neural networks using technologies like PyTorch,
Tensorflow, and Word2vec
+* Highlight how easy it is to play with the above technologies using Apache
Groovy
If you are new to Groovy, consider checking out this
https://opensource.com/article/20/12/groovy[Groovy game building tutorial]
first.
@@ -867,7 +868,7 @@ queries.each { query ->
----
For each query, we find the 5 closest phrases from the sample phrase.
-When run, the results look like this:
+When run, the results look like this (library logging elided):
----
cow
@@ -923,8 +924,8 @@ bovine (0.39)
=== UAE
The https://djl.ai/[Deep Java Library]
-also has
link:++https://www.tensorflow.org/[TensorFlow]++[https://pytorch.org/[Tensorflow\]]
integration
-to load and use the
https://research.google/pubs/universal-sentence-encoder/[USE]
https://www.kaggle.com/models/google/universal-sentence-encoder[model].
+also has
link:++https://www.tensorflow.org/[TensorFlow]++[https://pytorch.org/[TensorFlow\]]
integration
+which we'll use to load and exercise Google's
https://research.google/pubs/universal-sentence-encoder/[USE]
https://www.kaggle.com/models/google/universal-sentence-encoder[model].
The USE model also supports conceptual embeddings, so we'll use the same
phrases as we did for AngleE.
@@ -962,7 +963,7 @@ var queryEmbeddings = predictor.predict(queries)
queryEmbeddings.eachWithIndex { s, i ->
println "\n ${queries[i]}"
sampleEmbeddings
- .collect { MathUtil.cosineSimilarity(it, s) }
+ .collect { cosineSimilarity(it, s) }
.withIndex()
.sort { -it.v1 }
.take(5)
@@ -970,7 +971,7 @@ queryEmbeddings.eachWithIndex { s, i ->
}
----
-Here is the output:
+Here is the output (library logging elided):
----
cow
@@ -1140,13 +1141,21 @@ ConceptNet stays around 0% for such words and can even
go negative.
* Different models do better in different situations at recognizing
similarity, i.e. there is no perfect model
that seems to always outperform the others.
-Looking at these results, if we were doing a production ready game, we'd just
pick ConceptNet, and
-we'd probably look for an English only model since the multilingual one takes
the longest of all 5
+Looking at these results, if we were doing a production ready game, we'd just
pick one model, probably ConceptNet,
+and we'd probably look for an English only model since the multilingual one
takes the longest of all 5
models to load. But given the educational tone of this post, [fuchsia]#we'll
include the semantic similarity
measure from all 5 models in our game#.
== Playing the game
+The game has a very simple text UI. It runs on your operating systems shell,
command or console windows
+or within your IDE. The game picks a random word of unknown size. You are
given 30 rounds to guess the hidden word.
+For each round, you can enter one word, and you will be given numerous metrics
about how similar
+your guess is to the hidden word. You will be given some hints if you take too
long (more on that later).
+
+Let's see what some rounds of play look like, and we'll give you some
commentary on
+the thinking we were using when playing those rounds.
+
=== Round 1
There are lists of long words with unique letters. One that is often useful is
`aftershock`.
@@ -1311,7 +1320,6 @@ Meaning AnglE 100% / Use 100% /
ConceptNet 100% / GloVe 1
Congratulations, you guessed correctly!
----
-
=== Round 3
----