(groovy-website) branch asf-site updated: Add details on source of word2vec models

paulk Tue, 18 Feb 2025 02:50:02 -0800

This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new c21ec50  Add details on source of word2vec models
c21ec50 is described below

commit c21ec50bf97c6bd296d9d723bf8353d08acb6e1d
Author: James King <[email protected]>
AuthorDate: Tue Feb 18 19:53:16 2025 +1000

    Add details on source of word2vec models
---
 site/src/site/blog/groovy-text-similarity.adoc | 50 +++++++++++++++-----------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/site/src/site/blog/groovy-text-similarity.adoc 
b/site/src/site/blog/groovy-text-similarity.adoc
index d262508..419ed5b 100644
--- a/site/src/site/blog/groovy-text-similarity.adoc
+++ b/site/src/site/blog/groovy-text-similarity.adoc
@@ -650,19 +650,27 @@ model is trained and optimized for greater-than-word 
length text, such as senten
 https://www.tensorflow.org/[TensorFlow].
 We'll use the https://djl.ai/[Deep Java Library] to load and use both of these 
models on the JDK.
 
-=== GloVe
+=== FastText
+Many Word2Vec libraries can read and write models to a standard file format 
based on the original https://github.com/tmikolov/word2vec[Google 
implementation].
 
-The
-https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J]
-library makes it easy to use https://nlp.stanford.edu/projects/glove/[GloVe] 
models. The https://huggingface.co/fse/glove-wiki-gigaword-300[model we used] 
is pre-trained on
-2B tweets, 27B tokens, 1.2M vocab, uncased.
+For example, we can download some of the pre-trained models from the 
https://radimrehurek.com/gensim/models/word2vec.html[Gensim Python library], 
convert them to the Word2Vec format, and then open them with 
https://deeplearning4j.konduit.ai/deeplearning4j/reference/word2vec-glove-doc2vec[DeepLearning4J].
 
-We simply serialize the model into our word2vec representation,
+[source,python]
+----
+import gensim.downloader
+glove_vectors = gensim.downloader.load('fasttext-wiki-news-subwords-300')
+glove_vectors.save_word2vec_format("fasttext-wiki-news-subwords-300.bin", 
binary=True)
+----
+
+https://huggingface.co/fse/fasttext-wiki-news-subwords-300[This model] has
+1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and 
statmt.org news dataset (16B tokens).
+
+We simply serialize the https://fasttext.cc/[FastText] model into our Word2Vec 
representation,
 and can then call methods like `similarity` and `wordsNearest` as shown here:
 
 [source,groovy]
 ----
-var modelName = 'glove-wiki-gigaword-300.bin'
+var modelName = 'fasttext-wiki-news-subwords-300.bin'
 var path = 
Paths.get(ConceptNet.classLoader.getResource(modelName).toURI()).toFile()
 Word2Vec model = WordVectorSerializer.readWord2VecModel(path)
 String[] words = ['bull', 'calf', 'bovine', 'cattle', 'livestock', 'horse']
@@ -679,8 +687,8 @@ Nearest words in vocab: ${model.wordsNearest('cow', 4)}
 Which gives this output:
 
 ----
-GloVe similarity to cow: [bovine:0.67, cattle:0.62, livestock:0.47, calf:0.44, 
horse:0.42, bull:0.38]
-Nearest words in vocab: [cows, mad, bovine, cattle]
+FastText similarity to cow: [bovine:0.72, cattle:0.70, calf:0.67, bull:0.67, 
livestock:0.61, horse:0.60]
+Nearest words in vocab: [cows, goat, pig, bovine]
 ----
 
 We have numerous options available to us to visualize these kinds of results.
@@ -688,28 +696,28 @@ We could use the bar-charts we used previously, or 
something like a heat-map:
 
 image:img/AnimalSemanticSimilarity.png[animal semantic similarity,width=60%]
 
-=== FastText
+Groupings of similar words can be seen as the larger orange and red regions. 
We can also quickly check
+that there is a stronger relationship between `cow` and `milk` vs `cow` and 
`water`.
+
+=== GloVe
 
-We can swap to a https://fasttext.cc/[FastText] model, simply by switching to 
that model:
+We can swap to a https://nlp.stanford.edu/projects/glove/[GloVe] model, simply 
by changing the file we read from:
 
 [source,groovy]
 ----
-var modelName = 'fasttext-wiki-news-subwords-300.bin'
+var modelName = 'glove-wiki-gigaword-300.bin'
 ----
 
-We used https://huggingface.co/fse/fasttext-wiki-news-subwords-300[this model] 
which has
-1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and 
statmt.org news dataset (16B tokens).
-
-When run with the FastText model, the script has this output:
+With this model, the script has the following output:
 
 ----
-FastText similarity to cow: [bovine:0.72, cattle:0.70, calf:0.67, bull:0.67, 
livestock:0.61, horse:0.60]
-Nearest words in vocab: [cows, goat, pig, bovine]
+GloVe similarity to cow: [bovine:0.67, cattle:0.62, livestock:0.47, calf:0.44, 
horse:0.42, bull:0.38]
+Nearest words in vocab: [cows, mad, bovine, cattle]
 ----
 
-Again, we have numerous options to visualise the data returned from this model.
-Instead of look for nearest words or the similarity measure, we could return 
the actual word embeddings and then visualise
-those using principal component analysis (PCA), as shown here:
+Another way to visualise word embeddings is to display the word vectors as 
positions in space. However,
+the raw vectors contain far too many dimensions for us mere mortals to 
comprehend. We can use principal component
+analysis (PCA) to reduce the number of dimensions whilst capturing the most 
important information as shown here:
 
 image:img/AnimalSemanticMeaningPcaBubblePlot.png[principal component analysis 
of animal-related word embeddings,width=75%]

(groovy-website) branch asf-site updated: Add details on source of word2vec models

Reply via email to