El 2020-06-18 07:59, Hèctor Alòs i Font escribió:
Missatge de Francis Tyers <fty...@prompsit.com> del dia dj., 18 de
juny 2020 a les 1:59:

El 2020-06-17 21:46, Hèctor Alòs i Font escribió:
Missatge de Hèctor Alòs i Font <hectora...@gmail.com> del dia
dc.,
17 de juny 2020 a les 23:36:

Missatge de Francis Tyers <fty...@prompsit.com> del dia dc., 17
de
juny 2020 a les 21:12:

El 2020-06-15 17:38, Hèctor Alòs i Font escribió:


...snip...


I'd add that one of the problems with that is that this synonyms
may
be polysemic. For instance "bubota" seems to be quite widely used
in
Balearic Catalan, but can mean both "scarecrow" and "ghost".
Probably
just one of the two could be selected as synonym if "bubota" is
missing in a bilingual dictionary.

Yep, this is the kind of thing that people are working on at the
moment
with neural machine translation. For example in translating informal
texts, how do you make sure that you get the translations of
"today",
"2day" "tooday" "tday" etc, such as in:


https://www.clsp.jhu.edu/workshops/19-workshop/improving-translation-of-informal-language/

This kind of problems are typical, and even very often, for languages
without or with a weak standard. For instance, currently in Arpitan
along the standard termination "ament" in many nouns and adverbs, I
found tens of "ement" and even "èment". Similarly instead of "ê" I
found "è", or the opposite, and instead of "â" I found "a", or the
opposite. It's a big mess, when I get "real" texts on the net. But
defining in the monodix that every "ê" can be "è", and the opposite,
every "â" can be "a", and the opposite" would cause a huge quantity
of homonyms that would make disambiguation almost impossible (so I
won't do it).

Yes, I found the same in K'iche'. One of the things that can be done in
this case is to have a "spellrelax" transducer which is composed on top
of the other transducer.

So, this kind of improvement  may help translators from underresourced
languages... if enormous corpora are not required to learn the
"rules".

Well, enormous corpora are not a problem, so long as they are not required
in the under-resourced language. If a large French corpus can be used
to improve Arpitan, I don't see it as a problem.

Serge Sharoff is doing interesting things with embeddings and syncretism:

http://corpus.leeds.ac.uk/serge/publications/2019-jnle.pdf

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to