This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new e9b0ef4  expand AnglE section
e9b0ef4 is described below

commit e9b0ef41527a1496556569ac164eb0f4e755b960
Author: Paul King <[email protected]>
AuthorDate: Tue Feb 18 17:01:25 2025 +1000

    expand AnglE section
---
 site/src/site/blog/groovy-text-similarity.adoc | 111 +++++++++++++++++++------
 1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/site/src/site/blog/groovy-text-similarity.adoc 
b/site/src/site/blog/groovy-text-similarity.adoc
index 21e8e2c..c5010cb 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -99,7 +99,7 @@ Then we'll look at some libraries for phonetic matching:
 Then we'll look at some deep learning options for semantic matching:
 
 * `org.deeplearning4j:deeplearning4j-nlp` for GloVe, ConceptNet, and FastText 
models
-* `ai.djl` with Pytorch for a universal-sentence-encoder model and Tensorflow 
with an Angle model
+* `ai.djl` with Pytorch for a universal-sentence-encoder model and Tensorflow 
with an AnglE model
 
 == Simple String Metrics
 
@@ -618,8 +618,11 @@ Related words tend to cluster in similar positions within 
that space.
 Typically rule-based, statistical, or neural-based approaches are used to 
perform the embedding
 and distance measures like 
https://en.wikipedia.org/wiki/Cosine_similarity[cosine similarity]
 are used to find related words (or phrases).
-We won't go into further NLP theory in any great detail, but we'll give some 
brief
-explanation as we go. We'll look at several models and split them into two 
groups:
+
+Large Language Model (LLM) researchers call the kind of matching tasks we are 
doing here
+semantic textual similarity (STS) tasks. We won't go into further NLP theory 
in any great detail,
+but we'll give some brief explanation as we go.
+We'll look at several models and split them into two groups:
 
 * Context-independent approaches focus on embeddings that
 are applicable in all contexts (very roughly).
@@ -640,7 +643,7 @@ sentence embeddings and is used in conjunction with 
https://pytorch.org/[PyTorch
 Google's 
https://www.kaggle.com/models/google/universal-sentence-encoder[Universal 
Sentence Encoder]
 model is trained and optimized for greater-than-word length text, such as 
sentences, phrases or short paragraphs, and is used in conjunction with
 https://www.tensorflow.org/[TensorFlow].
-We'll use the https://djl.ai/[Deep Java Library] to load and use both of these 
models.
+We'll use the https://djl.ai/[Deep Java Library] to load and use both of these 
models on the JDK.
 
 === GloVe
 
@@ -675,6 +678,11 @@ GloVe similarity to cow: [bovine:0.67, cattle:0.62, 
livestock:0.47, calf:0.44, h
 Nearest words in vocab: [cows, mad, bovine, cattle]
 ----
 
+We have numerous options available to us to visualize these kinds of results.
+We could use the bar-charts we used previously, or something like a heat-map:
+
+image:img/AnimalSemanticSimilarity.png[animal semantic similarity,width=60%]
+
 === FastText
 
 We can swap to a https://fasttext.cc/[FastText] model, simply by switching to 
that model:
@@ -694,6 +702,12 @@ FastText similarity to cow: [bovine:0.72, cattle:0.70, 
calf:0.67, bull:0.67, liv
 Nearest words in vocab: [cows, goat, pig, bovine]
 ----
 
+Again, we have numerous options to visualise the data returned from this model.
+Instead of look for nearest words or the similarity measure, we could return 
the actual word embeddings and then visualise
+those using principal component analysis (PCA), as shown here:
+
+image:img/AnimalSemanticMeaningPcaBubblePlot.png[principal component analysis 
of animal-related word embeddings,width=75%]
+
 === ConceptNet
 
 Similarly, we can switch to a ConceptNet model through a change of the model 
name.
@@ -798,7 +812,58 @@ if you speak multiple languages or if you were learning a 
new language.
 
 === AnglE
 
-Using DJL with PyTorch and the AnglE model:
+We'll use the https://djl.ai/[Deep Java Library]
+and its https://pytorch.org/[PyTorch] integration
+to load and use the https://github.com/SeanLee97/AnglE[AnglE] 
https://huggingface.co/WhereIsAI/UAE-Large-V1[model].
+This model has one various state-of-the-art (SOTA) awards in the STS field.
+
+Unlike the previous models, AnglE supports conceptual embeddings.
+Therefore, we can feed in phrases not just words and it will match up similar 
phrases
+taking into account the context information about the word usage in that 
phrase.
+
+So, let's have a set of sample phrases and try to find the closest of those 
phrases by similarity
+to some sample queries. The code looks like this:
+
+[source,groovy]
+----
+var samplePhrases = [
+    'bull', 'bovine', 'kitten', 'hay', 'The sky is blue',
+    'The sea is blue', 'The grass is green', 'One two three',
+    'Bulls consume hay', 'Bovines convert grass to milk',
+    'Dogs play in the grass', 'Bulls trample grass',
+    'Dachshunds are delightful', 'I like cats and dogs']
+
+var queries = [
+    'cow', 'cat', 'dog', 'grass', 'Cows eat grass',
+    'Poodles are cute', 'The water is turquoise']
+
+var modelName = 'UAE-Large-V1.zip'
+var path = 
Paths.get(DjlPytorchAngle.classLoader.getResource(modelName).toURI())
+var criteria = Criteria.builder()
+    .setTypes(String, float[])
+    .optModelPath(path)
+    .optTranslatorFactory(new DeferredTranslatorFactory())
+    .optProgress(new ProgressBar())
+    .build()
+
+var model = criteria.loadModel()
+var predictor = model.newPredictor()
+var sampleEmbeddings = samplePhrases.collect(predictor::predict)
+
+queries.each { query ->
+    println "\n    $query"
+    var queryEmbedding = predictor.predict(query)
+    sampleEmbeddings
+        .collect { cosineSimilarity(it, queryEmbedding) }
+        .withIndex()
+        .sort { -it.v1 }
+        .take(5)
+        .each { printf '%s (%4.2f)%n', samplePhrases[it.v2], it.v1 }
+}
+----
+
+For each query, we find the 5 closest phrases from the sample phrase.
+When run, the results look like this:
 
 ----
     cow
@@ -810,17 +875,17 @@ Bulls consume hay (0.56)
 
     cat
 kitten (0.82)
+I like cats and dogs (0.70)
 bull (0.63)
 bovine (0.60)
 One two three (0.59)
-hay (0.55)
 
     dog
 bull (0.69)
 bovine (0.68)
+I like cats and dogs (0.60)
 kitten (0.58)
 Dogs play in the grass (0.58)
-Dachshunds are delightful (0.55)
 
     grass
 The grass is green (0.83)
@@ -839,9 +904,9 @@ bovine (0.62)
     Poodles are cute
 Dachshunds are delightful (0.63)
 Dogs play in the grass (0.56)
+I like cats and dogs (0.55)
 bovine (0.49)
 The grass is green (0.44)
-kitten (0.44)
 
     The water is turquoise
 The sea is blue (0.72)
@@ -906,10 +971,6 @@ kitten (0.17)
 One two three (0.17)
 ----
 
-image:img/AnimalSemanticSimilarity.png[]
-
-image:img/AnimalSemanticMeaningPcaBubblePlot.png[]
-
 
 === Comparing Algorithm Choices
 
@@ -1028,7 +1089,7 @@ Levenshtein                    Distance: 10, Insert: 0, 
Delete: 3, Substitute: 7
 Jaccard                        0%
 JaroWinkler                    PREFIX 0% / SUFFIX 0%
 Phonetic                       Metaphone=AFTRXK 47% / Soundex=A136 0%
-Meaning                        Angle 45% / Use 21% / ConceptNet 2% / GloVe -4% 
/ FastText 19%
+Meaning                        AnglE 45% / Use 21% / ConceptNet 2% / GloVe -4% 
/ FastText 19%
 ----
 
 It looks like we really bombed out, but in fact this is good news. What did we 
learn:
@@ -1056,7 +1117,7 @@ Levenshtein                    Distance: 6, Insert: 2, 
Delete: 0, Substitute: 4
 Jaccard                        22%
 JaroWinkler                    PREFIX 56% / SUFFIX 45%
 Phonetic                       Metaphone=FRT 39% / Soundex=F630 0%
-Meaning                        Angle 64% / Use 41% / ConceptNet 37% / GloVe 
31% / FastText 44%
+Meaning                        AnglE 64% / Use 41% / ConceptNet 37% / GloVe 
31% / FastText 44%
 ----
 
 What did we learn?
@@ -1084,7 +1145,7 @@ Levenshtein                    Distance: 1, Insert: 0, 
Delete: 0, Substitute: 1
 Jaccard                        71%
 JaroWinkler                    PREFIX 90% / SUFFIX 96%
 Phonetic                       Metaphone=BTNK 79% / Soundex=B352 75%
-Meaning                        Angle 52% / Use 35% / ConceptNet 2% / GloVe 4% 
/ FastText 25%
+Meaning                        AnglE 52% / Use 35% / ConceptNet 2% / GloVe 4% 
/ FastText 25%
 ----
 
 We have 6 letters right in a row and 5 of the 6 distinct letters.
@@ -1100,7 +1161,7 @@ Levenshtein                    Distance: 0, Insert: 0, 
Delete: 0, Substitute: 0
 Jaccard                        100%
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=PTNK 100% / Soundex=P352 100%
-Meaning                        Angle 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
+Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
 
 Congratulations, you guessed correctly!
 ----
@@ -1115,7 +1176,7 @@ Levenshtein                    Distance: 7, Insert: 4, 
Delete: 0, Substitute: 3
 Jaccard                        22%
 JaroWinkler                    PREFIX 42% / SUFFIX 46%
 Phonetic                       Metaphone=BL 38% / Soundex=B400 25%
-Meaning                        Angle 46% / Use 40% / ConceptNet 0% / GloVe 0% 
/ FastText 31%
+Meaning                        AnglE 46% / Use 40% / ConceptNet 0% / GloVe 0% 
/ FastText 31%
 ----
 * Since LCS is 1, [fuchsia]#the letters shared with the hidden word are in the 
reverse order#.
 * There were 4 inserts and 0 deletes which means [fuchsia]#the hidden word has 
8 letters#.
@@ -1128,7 +1189,7 @@ Levenshtein                    Distance: 6, Insert: 5, 
Delete: 0, Substitute: 1
 Jaccard                        25%
 JaroWinkler                    PREFIX 47% / SUFFIX 0%
 Phonetic                       Metaphone=LK 38% / Soundex=L200 0%
-Meaning                        Angle 50% / Use 18% / ConceptNet 11% / GloVe 
13% / FastText 37%
+Meaning                        AnglE 50% / Use 18% / ConceptNet 11% / GloVe 
13% / FastText 37%
 ----
 * Jaccard of 2 / 8 tells us [fuchsia]#two of the letters in 'leg' appear in 
the hidden word#.
 * LCS of 2 tells us that [fuchsia]#they appear in the same order as in the 
hidden word#.
@@ -1145,7 +1206,7 @@ Levenshtein                    Distance: 8, Insert: 0, 
Delete: 0, Substitute: 8
 Jaccard                        15%
 JaroWinkler                    PREFIX 50% / SUFFIX 50%
 Phonetic                       Metaphone=LNKX 34% / Soundex=L522 0%
-Meaning                        Angle 46% / Use 12% / ConceptNet -11% / GloVe 
-4% / FastText 25%
+Meaning                        AnglE 46% / Use 12% / ConceptNet -11% / GloVe 
-4% / FastText 25%
 ----
 * 8 substitutions means [fuchsia]#none of the letters are in the same spot as 
'languish'#.
 
@@ -1158,7 +1219,7 @@ Levenshtein                    Distance: 4, Insert: 0, 
Delete: 0, Substitute: 4
 Jaccard                        40%
 JaroWinkler                    PREFIX 83% / SUFFIX 75%
 Phonetic                       Metaphone=ELKXN 50% / Soundex=E423 75%
-Meaning                        Angle 47% / Use 13% / ConceptNet -5% / GloVe 
-7% / FastText 26%
+Meaning                        AnglE 47% / Use 13% / ConceptNet -5% / GloVe 
-7% / FastText 26%
 ----
 * Jaccard tells us we have 4 distinct letters shared with the hidden word and 
yet we have a LCS of 5. [fuchsia]#The duplicate 'E' must be correct and the 
order of all correct letters must match the hidden word.#
 * Only 4 substitutions means [fuchsia]#8-4=4 letters are in the correct 
position#.
@@ -1174,7 +1235,7 @@ Levenshtein                    Distance: 0, Insert: 0, 
Delete: 0, Substitute: 0
 Jaccard                        100%
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=ELFTR 100% / Soundex=E413 100%
-Meaning                        Angle 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
+Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
 
 Congratulations, you guessed correctly!
 ----
@@ -1190,7 +1251,7 @@ Levenshtein                    Distance: 8, Insert: 0, 
Delete: 4, Substitute: 4
 Jaccard                        50%
 JaroWinkler                    PREFIX 61% / SUFFIX 49%
 Phonetic                       Metaphone=AFTRXK 33% / Soundex=A136 25%
-Meaning                        Angle 44% / Use 11% / ConceptNet -7% / GloVe 1% 
/ FastText 15%
+Meaning                        AnglE 44% / Use 11% / ConceptNet -7% / GloVe 1% 
/ FastText 15%
 ----
 
 What do we know?
@@ -1212,7 +1273,7 @@ Levenshtein                    Distance: 4, Insert: 0, 
Delete: 0, Substitute: 4
 Jaccard                        57%
 JaroWinkler                    PREFIX 67% / SUFFIX 67%
 Phonetic                       Metaphone=KRS 74% / Soundex=C620 75%
-Meaning                        Angle 51% / Use 12% / ConceptNet 5% / GloVe 23% 
/ FastText 26%
+Meaning                        AnglE 51% / Use 12% / ConceptNet 5% / GloVe 23% 
/ FastText 26%
 ----
 
 This tells us:
@@ -1233,7 +1294,7 @@ Levenshtein                    Distance: 6, Insert: 0, 
Delete: 0, Substitute: 6
 Jaccard                        67%
 JaroWinkler                    PREFIX 56% / SUFFIX 56%
 Phonetic                       Metaphone=RSTS 61% / Soundex=R232 25%
-Meaning                        Angle 54% / Use 25% / ConceptNet 18% / GloVe 
18% / FastText 31%
+Meaning                        AnglE 54% / Use 25% / ConceptNet 18% / GloVe 
18% / FastText 31%
 ----
 
 We learned:
@@ -1253,7 +1314,7 @@ Levenshtein                    Distance: 0, Insert: 0, 
Delete: 0, Substitute: 0
 Jaccard                        100%
 JaroWinkler                    PREFIX 100% / SUFFIX 100%
 Phonetic                       Metaphone=KRT 100% / Soundex=C630 100%
-Meaning                        Angle 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
+Meaning                        AnglE 100% / Use 100% / ConceptNet 100% / GloVe 
100% / FastText 100%
 
 Congratulations, you guessed correctly!
 ----

Reply via email to