Dear Ayah, I asked my colleagues and apparently yes, the tagger removes all diacritics.
Best Milos Milos Jakubicek CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk On 2 February 2018 at 18:11, Ayah Zirikly <aya.zeri...@gmail.com> wrote: > Hi Milos, > > Thank you for providing the pretrained word vectors. I am specifically > interested in the Arabic version. > I have a question in regards to Hamza manipulation, I noticed when > searching for أحمد [Ahmad or >Hmd in Buckwalter] the results were empty as > opposed to using احمد without hamza. Did you normalize all the hamza to > regular alef? > > Thank you, > Ayah > > On Fri, Feb 2, 2018 at 9:07 AM, Miloš Jakubíček < > milos.jakubi...@sketchengine.co.uk> wrote: > >> Dear all, >> >> this is to announce public availability of word embedding model >> calculated for large corpora that we have in Sketch Engine. At this moment, >> we have processed corpora for following languages: >> >> English, Arabic, Chinese, Czech, Danish, French, German, Italian, Korean, >> Portuguese, Russian, Spanish >> >> See https://embeddings.sketchengine.co.uk/ where you can find an online >> interface for executing word similarity queries (such as the infamous >> king-man+woman) and download the datasets. They can be used freely for >> non-commercial purposes, for the commercial ones do not hesitate to get >> back to me to work out a mutually suitable model of collaboration. >> >> We continue building further models as our spare computing capacity >> allows, and will continue publishing them. If you are interested in a >> particular language that is missing at this moment, let me know and I can >> try to prioritise (no guarantees though). >> >> The embeddings were calculated using FastText with various parameters and >> on various corpus attributes (word, lemma, lemma+PoS combination, lowercase >> etc.) >> >> We have had increasing amount of requests to obtain corpora from Sketch >> Engine for these purposes, so this is our response to that to support >> research in this area. >> >> Cheers, >> Milos Jakubicek >> >> CEO, Lexical Computing >> Brno, CZ | Brighton, UK >> http://www.lexicalcomputing.com >> http://www.sketchengine.co.uk >> >> _______________________________________________ >> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora >> Corpora mailing list >> Corpora@uib.no >> https://mailman.uib.no/listinfo/corpora >> >> >
_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora@uib.no https://mailman.uib.no/listinfo/corpora