this is to announce public availability of word embedding model calculated
for large corpora that we have in Sketch Engine. At this moment, we have
processed corpora for following languages:
English, Arabic, Chinese, Czech, Danish, French, German, Italian, Korean,
Portuguese, Russian, Spanish
See https://embeddings.sketchengine.co.uk/ where you can find an online
interface for executing word similarity queries (such as the infamous
king-man+woman) and download the datasets. They can be used freely for
non-commercial purposes, for the commercial ones do not hesitate to get
back to me to work out a mutually suitable model of collaboration.
We continue building further models as our spare computing capacity allows,
and will continue publishing them. If you are interested in a particular
language that is missing at this moment, let me know and I can try to
prioritise (no guarantees though).
The embeddings were calculated using FastText with various parameters and
on various corpus attributes (word, lemma, lemma+PoS combination, lowercase
We have had increasing amount of requests to obtain corpora from Sketch
Engine for these purposes, so this is our response to that to support
research in this area.
CEO, Lexical Computing
Brno, CZ | Brighton, UK
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list