This looks interesting.
Note that generating target language morphology may not always be
possible, unless a "guessing" dictionary is created automatically from
both the source and target dictionaries. A "guessing" dictionary would
try to assign a morphological analysis to an unknown word by looking at
the morphology of known words in the dictionary...
This would be easy if one could, e.g. match suffixes to morphology in a
suffixing language.
Mikel
El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:
Hey guys,
Dictionary trimming is the process of removing those words and their
analyses from monolingual language models (FSTs compiled from
monodixes) which don't have an entry in the bidix, to avoid a lot of
untranslated lemmas (with an @ if debugging) in the output, which lead
to issues with comprehension and post-editing the output.
There is a GSoC project
<http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Eliminate_trimming>
which aims to eliminate this trimming and propose a solution such that
you don't lose the benefits of dictionary trimming as well. In this
email I will list a summary of the discussion that has taken place up
until now.
By trimming the dictionary, you throw away valuable analyses of words
in the source language, which, if preserved, can be used as context
for lexical selection and analysis of the input. Also, several
transfer rules don't match as the word is shown as unknown.
Several solutions are possible for avoiding trimming, some of which
have been discussed by Unhammer here
<http://wiki.apertium.org/wiki/Talk:Why_we_trim>. These involve
keeping the surface form of the source word, and the lemma+analysis as
well - use the analysis till you need it in the pipe and then
propagate the source form as an unknown word (like it would be done in
trimming).
Another interesting solution that was discussed was that instead of
just propagating the source surface form, we can output [source-word
lemma + target morphology], as is shown in this example by Mikel:
Translating from Basque to English:
"Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni
*izeki-ed the sheets".
This might help in comprehensibility of the output, and to some extent
even the post-editability.
If you have any significant pros, cons, or suggestions to add for this
project, you're requested to reply to this thread so that if I work on
this project, I can do it fully informed.
Thanks and Regards,
Tanmai Khanna
--
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff
--
Mikel L. Forcada http://www.dlsi.ua.es/~mlf/
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03690 Sant Vicent del Raspeig
Spain
Office: +34 96 590 9776
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff