Dear Ayah,

I asked my colleagues and apparently yes, the tagger removes all diacritics.


Milos Jakubicek

CEO, Lexical Computing
Brno, CZ | Brighton UK

On 2 February 2018 at 18:11, Ayah Zirikly <> wrote:

> Hi Milos,
> Thank you for providing the pretrained word vectors. I am specifically
> interested in the Arabic version.
> I have a question in regards to Hamza manipulation, I noticed when
> searching for أحمد [Ahmad or >Hmd in Buckwalter] the results were empty as
> opposed to using احمد without hamza. Did you normalize all the hamza to
> regular alef?
> Thank you,
>  Ayah
> On Fri, Feb 2, 2018 at 9:07 AM, Miloš Jakubíček <
>> wrote:
>> Dear all,
>> this is to announce public availability of word embedding model
>> calculated for large corpora that we have in Sketch Engine. At this moment,
>> we have processed corpora for following languages:
>> English, Arabic, Chinese, Czech, Danish, French, German, Italian, Korean,
>> Portuguese, Russian, Spanish
>> See where you can find an online
>> interface for executing word similarity queries (such as the infamous
>> king-man+woman) and download the datasets. They can be used freely for
>> non-commercial purposes, for the commercial ones do not hesitate to get
>> back to me to work out a mutually suitable model of collaboration.
>> We continue building further models as our spare computing capacity
>> allows, and will continue publishing them. If you are interested in a
>> particular language that is missing at this moment, let me know and I can
>> try to prioritise (no guarantees though).
>> The embeddings were calculated using FastText with various parameters and
>> on various corpus attributes (word, lemma, lemma+PoS combination, lowercase
>> etc.)
>> We have had increasing amount of requests to obtain corpora from Sketch
>> Engine for these purposes, so this is our response to that to support
>> research in this area.
>> Cheers,
>> Milos Jakubicek
>> CEO, Lexical Computing
>> Brno, CZ | Brighton, UK
>> _______________________________________________
>> UNSUBSCRIBE from this page:
>> Corpora mailing list
UNSUBSCRIBE from this page:
Corpora mailing list

Reply via email to