This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new e9b0ef4 expand AnglE section
e9b0ef4 is described below
commit e9b0ef41527a1496556569ac164eb0f4e755b960
Author: Paul King <[email protected]>
AuthorDate: Tue Feb 18 17:01:25 2025 +1000
expand AnglE section
---
site/src/site/blog/groovy-text-similarity.adoc | 111 +++++++++++++++++++------
1 file changed, 86 insertions(+), 25 deletions(-)
diff --git a/site/src/site/blog/groovy-text-similarity.adoc
b/site/src/site/blog/groovy-text-similarity.adoc
index 21e8e2c..c5010cb 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -99,7 +99,7 @@ Then we'll look at some libraries for phonetic matching:
Then we'll look at some deep learning options for semantic matching:
* `org.deeplearning4j:deeplearning4j-nlp` for GloVe, ConceptNet, and FastText
models
-* `ai.djl` with Pytorch for a universal-sentence-encoder model and Tensorflow
with an Angle model
+* `ai.djl` with Pytorch for a universal-sentence-encoder model and Tensorflow
with an AnglE model
== Simple String Metrics
@@ -618,8 +618,11 @@ Related words tend to cluster in similar positions within
that space.
Typically rule-based, statistical, or neural-based approaches are used to
perform the embedding
and distance measures like
https://en.wikipedia.org/wiki/Cosine_similarity[cosine similarity]
are used to find related words (or phrases).
-We won't go into further NLP theory in any great detail, but we'll give some
brief
-explanation as we go. We'll look at several models and split them into two
groups:
+
+Large Language Model (LLM) researchers call the kind of matching tasks we are
doing here
+semantic textual similarity (STS) tasks. We won't go into further NLP theory
in any great detail,
+but we'll give some brief explanation as we go.
+We'll look at several models and split them into two groups:
* Context-independent approaches focus on embeddings that
are applicable in all contexts (very roughly).
@@ -640,7 +643,7 @@ sentence embeddings and is used in conjunction with
https://pytorch.org/[PyTorch
Google's
https://www.kaggle.com/models/google/universal-sentence-encoder[Universal
Sentence Encoder]
model is trained and optimized for greater-than-word length text, such as
sentences, phrases or short paragraphs, and is used in conjunction with
https://www.tensorflow.org/[TensorFlow].
-We'll use the https://djl.ai/[Deep Java Library] to load and use both of these
models.
+We'll use the https://djl.ai/[Deep Java Library] to load and use both of these
models on the JDK.
=== GloVe
@@ -675,6 +678,11 @@ GloVe similarity to cow: [bovine:0.67, cattle:0.62,
livestock:0.47, calf:0.44, h
Nearest words in vocab: [cows, mad, bovine, cattle]
----
+We have numerous options available to us to visualize these kinds of results.
+We could use the bar-charts we used previously, or something like a heat-map:
+
+image:img/AnimalSemanticSimilarity.png[animal semantic similarity,width=60%]
+
=== FastText
We can swap to a https://fasttext.cc/[FastText] model, simply by switching to
that model:
@@ -694,6 +702,12 @@ FastText similarity to cow: [bovine:0.72, cattle:0.70,
calf:0.67, bull:0.67, liv
Nearest words in vocab: [cows, goat, pig, bovine]
----
+Again, we have numerous options to visualise the data returned from this model.
+Instead of look for nearest words or the similarity measure, we could return
the actual word embeddings and then visualise
+those using principal component analysis (PCA), as shown here:
+
+image:img/AnimalSemanticMeaningPcaBubblePlot.png[principal component analysis
of animal-related word embeddings,width=75%]
+
=== ConceptNet
Similarly, we can switch to a ConceptNet model through a change of the model
name.
@@ -798,7 +812,58 @@ if you speak multiple languages or if you were learning a
new language.
=== AnglE
-Using DJL with PyTorch and the AnglE model:
+We'll use the https://djl.ai/[Deep Java Library]
+and its https://pytorch.org/[PyTorch] integration
+to load and use the https://github.com/SeanLee97/AnglE[AnglE]
https://huggingface.co/WhereIsAI/UAE-Large-V1[model].
+This model has one various state-of-the-art (SOTA) awards in the STS field.
+
+Unlike the previous models, AnglE supports conceptual embeddings.
+Therefore, we can feed in phrases not just words and it will match up similar
phrases
+taking into account the context information about the word usage in that
phrase.
+
+So, let's have a set of sample phrases and try to find the closest of those
phrases by similarity
+to some sample queries. The code looks like this:
+
+[source,groovy]
+----
+var samplePhrases = [
+ 'bull', 'bovine', 'kitten', 'hay', 'The sky is blue',
+ 'The sea is blue', 'The grass is green', 'One two three',
+ 'Bulls consume hay', 'Bovines convert grass to milk',
+ 'Dogs play in the grass', 'Bulls trample grass',
+ 'Dachshunds are delightful', 'I like cats and dogs']
+
+var queries = [
+ 'cow', 'cat', 'dog', 'grass', 'Cows eat grass',
+ 'Poodles are cute', 'The water is turquoise']
+
+var modelName = 'UAE-Large-V1.zip'
+var path =
Paths.get(DjlPytorchAngle.classLoader.getResource(modelName).toURI())
+var criteria = Criteria.builder()
+ .setTypes(String, float[])
+ .optModelPath(path)
+ .optTranslatorFactory(new DeferredTranslatorFactory())
+ .optProgress(new ProgressBar())
+ .build()
+
+var model = criteria.loadModel()
+var predictor = model.newPredictor()
+var sampleEmbeddings = samplePhrases.collect(predictor::predict)
+
+queries.each { query ->
+ println "\n $query"
+ var queryEmbedding = predictor.predict(query)
+ sampleEmbeddings
+ .collect { cosineSimilarity(it, queryEmbedding) }
+ .withIndex()
+ .sort { -it.v1 }
+ .take(5)
+ .each { printf '%s (%4.2f)%n', samplePhrases[it.v2], it.v1 }
+}
+----
+
+For each query, we find the 5 closest phrases from the sample phrase.
+When run, the results look like this:
----
cow
@@ -810,17 +875,17 @@ Bulls consume hay (0.56)
cat
kitten (0.82)
+I like cats and dogs (0.70)
bull (0.63)
bovine (0.60)
One two three (0.59)
-hay (0.55)
dog
bull (0.69)
bovine (0.68)
+I like cats and dogs (0.60)
kitten (0.58)
Dogs play in the grass (0.58)
-Dachshunds are delightful (0.55)
grass
The grass is green (0.83)
@@ -839,9 +904,9 @@ bovine (0.62)
Poodles are cute
Dachshunds are delightful (0.63)
Dogs play in the grass (0.56)
+I like cats and dogs (0.55)
bovine (0.49)
The grass is green (0.44)
-kitten (0.44)
The water is turquoise
The sea is blue (0.72)
@@ -906,10 +971,6 @@ kitten (0.17)
One two three (0.17)
----
-image:img/AnimalSemanticSimilarity.png[]
-
-image:img/AnimalSemanticMeaningPcaBubblePlot.png[]
-
=== Comparing Algorithm Choices
@@ -1028,7 +1089,7 @@ Levenshtein Distance: 10, Insert: 0,
Delete: 3, Substitute: 7
Jaccard 0%
JaroWinkler PREFIX 0% / SUFFIX 0%
Phonetic Metaphone=AFTRXK 47% / Soundex=A136 0%
-Meaning Angle 45% / Use 21% / ConceptNet 2% / GloVe -4%
/ FastText 19%
+Meaning AnglE 45% / Use 21% / ConceptNet 2% / GloVe -4%
/ FastText 19%
----
It looks like we really bombed out, but in fact this is good news. What did we
learn:
@@ -1056,7 +1117,7 @@ Levenshtein Distance: 6, Insert: 2,
Delete: 0, Substitute: 4
Jaccard 22%
JaroWinkler PREFIX 56% / SUFFIX 45%
Phonetic Metaphone=FRT 39% / Soundex=F630 0%
-Meaning Angle 64% / Use 41% / ConceptNet 37% / GloVe
31% / FastText 44%
+Meaning AnglE 64% / Use 41% / ConceptNet 37% / GloVe
31% / FastText 44%
----
What did we learn?
@@ -1084,7 +1145,7 @@ Levenshtein Distance: 1, Insert: 0,
Delete: 0, Substitute: 1
Jaccard 71%
JaroWinkler PREFIX 90% / SUFFIX 96%
Phonetic Metaphone=BTNK 79% / Soundex=B352 75%
-Meaning Angle 52% / Use 35% / ConceptNet 2% / GloVe 4%
/ FastText 25%
+Meaning AnglE 52% / Use 35% / ConceptNet 2% / GloVe 4%
/ FastText 25%
----
We have 6 letters right in a row and 5 of the 6 distinct letters.
@@ -1100,7 +1161,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100%
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=PTNK 100% / Soundex=P352 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
+Meaning AnglE 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----
@@ -1115,7 +1176,7 @@ Levenshtein Distance: 7, Insert: 4,
Delete: 0, Substitute: 3
Jaccard 22%
JaroWinkler PREFIX 42% / SUFFIX 46%
Phonetic Metaphone=BL 38% / Soundex=B400 25%
-Meaning Angle 46% / Use 40% / ConceptNet 0% / GloVe 0%
/ FastText 31%
+Meaning AnglE 46% / Use 40% / ConceptNet 0% / GloVe 0%
/ FastText 31%
----
* Since LCS is 1, [fuchsia]#the letters shared with the hidden word are in the
reverse order#.
* There were 4 inserts and 0 deletes which means [fuchsia]#the hidden word has
8 letters#.
@@ -1128,7 +1189,7 @@ Levenshtein Distance: 6, Insert: 5,
Delete: 0, Substitute: 1
Jaccard 25%
JaroWinkler PREFIX 47% / SUFFIX 0%
Phonetic Metaphone=LK 38% / Soundex=L200 0%
-Meaning Angle 50% / Use 18% / ConceptNet 11% / GloVe
13% / FastText 37%
+Meaning AnglE 50% / Use 18% / ConceptNet 11% / GloVe
13% / FastText 37%
----
* Jaccard of 2 / 8 tells us [fuchsia]#two of the letters in 'leg' appear in
the hidden word#.
* LCS of 2 tells us that [fuchsia]#they appear in the same order as in the
hidden word#.
@@ -1145,7 +1206,7 @@ Levenshtein Distance: 8, Insert: 0,
Delete: 0, Substitute: 8
Jaccard 15%
JaroWinkler PREFIX 50% / SUFFIX 50%
Phonetic Metaphone=LNKX 34% / Soundex=L522 0%
-Meaning Angle 46% / Use 12% / ConceptNet -11% / GloVe
-4% / FastText 25%
+Meaning AnglE 46% / Use 12% / ConceptNet -11% / GloVe
-4% / FastText 25%
----
* 8 substitutions means [fuchsia]#none of the letters are in the same spot as
'languish'#.
@@ -1158,7 +1219,7 @@ Levenshtein Distance: 4, Insert: 0,
Delete: 0, Substitute: 4
Jaccard 40%
JaroWinkler PREFIX 83% / SUFFIX 75%
Phonetic Metaphone=ELKXN 50% / Soundex=E423 75%
-Meaning Angle 47% / Use 13% / ConceptNet -5% / GloVe
-7% / FastText 26%
+Meaning AnglE 47% / Use 13% / ConceptNet -5% / GloVe
-7% / FastText 26%
----
* Jaccard tells us we have 4 distinct letters shared with the hidden word and
yet we have a LCS of 5. [fuchsia]#The duplicate 'E' must be correct and the
order of all correct letters must match the hidden word.#
* Only 4 substitutions means [fuchsia]#8-4=4 letters are in the correct
position#.
@@ -1174,7 +1235,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100%
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=ELFTR 100% / Soundex=E413 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
+Meaning AnglE 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----
@@ -1190,7 +1251,7 @@ Levenshtein Distance: 8, Insert: 0,
Delete: 4, Substitute: 4
Jaccard 50%
JaroWinkler PREFIX 61% / SUFFIX 49%
Phonetic Metaphone=AFTRXK 33% / Soundex=A136 25%
-Meaning Angle 44% / Use 11% / ConceptNet -7% / GloVe 1%
/ FastText 15%
+Meaning AnglE 44% / Use 11% / ConceptNet -7% / GloVe 1%
/ FastText 15%
----
What do we know?
@@ -1212,7 +1273,7 @@ Levenshtein Distance: 4, Insert: 0,
Delete: 0, Substitute: 4
Jaccard 57%
JaroWinkler PREFIX 67% / SUFFIX 67%
Phonetic Metaphone=KRS 74% / Soundex=C620 75%
-Meaning Angle 51% / Use 12% / ConceptNet 5% / GloVe 23%
/ FastText 26%
+Meaning AnglE 51% / Use 12% / ConceptNet 5% / GloVe 23%
/ FastText 26%
----
This tells us:
@@ -1233,7 +1294,7 @@ Levenshtein Distance: 6, Insert: 0,
Delete: 0, Substitute: 6
Jaccard 67%
JaroWinkler PREFIX 56% / SUFFIX 56%
Phonetic Metaphone=RSTS 61% / Soundex=R232 25%
-Meaning Angle 54% / Use 25% / ConceptNet 18% / GloVe
18% / FastText 31%
+Meaning AnglE 54% / Use 25% / ConceptNet 18% / GloVe
18% / FastText 31%
----
We learned:
@@ -1253,7 +1314,7 @@ Levenshtein Distance: 0, Insert: 0,
Delete: 0, Substitute: 0
Jaccard 100%
JaroWinkler PREFIX 100% / SUFFIX 100%
Phonetic Metaphone=KRT 100% / Soundex=C630 100%
-Meaning Angle 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
+Meaning AnglE 100% / Use 100% / ConceptNet 100% / GloVe
100% / FastText 100%
Congratulations, you guessed correctly!
----