Hey Hèctor,
You're right, the task of creating a SL Lemma + TL morph is not a trivial
one, and as we discussed on the IRC recently, the task of eliminating
trimming is an essential first step towards that goal.

So as for the task for this GSoC, it would probably be to first eliminate
dictionary trimming, use the source analysis and output the source word
surface form (as an unknown) instead of source lemma. This would give us
the benefits of trimming without actually trimming. Then we can set the
foundations for a morph guessing idea, which can evolve over time - and
yes, initially it would be an optional module.

Thanks for your comments.

Tanmai

On Sun, Mar 22, 2020 at 2:38 PM Hèctor Alòs i Font <hectora...@gmail.com>
wrote:

> I have some comments and questions from a simple Apertium user's point of
> view.
>
> In principle, I find the initial idea very useful: to go beyond trimming,
> maintaining its advantages, but keeping the source language information so
> that it is not lost for the transfer rules. Great!
>
> What I do not see so clearly is the (enormous) complication of inventing a
> word in the target language from more or less regular patterns in the
> target language (and maybe in the source too). If I understand correctly,
> it would be something like if we have, for example, "desaladoras" in
> Spanish, and we don't have this word in the bilingual dictionary. So the
> new module would try to produce something like "*desaladora's", or even
> "*desalators", "*dissalators" or "*dissaltators". And the same would be if
> the source and/or the target language are, say, Lingala, Tamil or Quechua.
> The more the target language has a complex morphology, the harder the task.
>
> This may be interesting, but the problem seems quite different from the
> first one and with a much higher degree of difficulty. I would separate the
> two issues. If we could ensure that we get a reliable tool for the first
> one, it would be very useful. If we also have a prototype for the second,
> which can be activated or not at the developer's discretion, it would be
> perfect.
>
> Hèctor
>
> Missatge de Mikel L. Forcada <m...@dlsi.ua.es> del dia dg., 22 de març
> 2020 a les 10:25:
>
>> For suffixing or prefixing languages, you could expand the morphological
>> dictionary and use an algorithm such as OSTIA (1) to learn morphological
>> analyses for word endings.
>>
>> Mikel
>>
>> (1) Oncina, J., Garcia, P., Vidal, E., IEEE Trans Patt Recog Mach Intell
>> 15:5 (1993)448-458.
>>
>>
>>
>>
>>
>>
>> El 21 de març de 2020 21:12:16 CET, Tanmai Khanna <
>> khanna.tan...@gmail.com> ha escrit:
>>>
>>> Guessing the morphology would definitely require some creativity, but
>>> yes a guessing dictionary could be created. As mentioned, it would assign
>>> morphs to morphological analysis in the TL. The easiest (and the most
>>> naive) way to do this might be to take all the entries with that analysis
>>> and find a common substring. It will be more complex for morphemes that
>>> aren't prefix or suffixes or even process morphemes. However, to work
>>> towards a morph analyser that can assign morphs to analyses sounds like a
>>> good goal to work towards, and eliminating dictionary trimming is an
>>> essential step in that direction.
>>>
>>> Tanmai
>>>
>>> On Sat, Mar 21, 2020 at 9:48 PM Mikel L. Forcada <m...@dlsi.ua.es> wrote:
>>>
>>>> This looks interesting.
>>>>
>>>> Note that generating target language morphology may not always be
>>>> possible, unless a "guessing" dictionary is created automatically from both
>>>> the source and target dictionaries. A "guessing" dictionary would try to
>>>> assign a morphological analysis to an unknown word by looking at the
>>>> morphology of known words in the dictionary...
>>>>
>>>> This would be easy if one could, e.g. match suffixes to morphology in a
>>>> suffixing language.
>>>>
>>>> Mikel
>>>>
>>>>
>>>> El 21/3/20 a les 15:37, Tanmai Khanna ha escrit:
>>>>
>>>> Hey guys,
>>>> Dictionary trimming is the process of removing those words and their
>>>> analyses from monolingual language models (FSTs compiled from
>>>> monodixes) which don't have an entry in the bidix, to avoid a lot of
>>>> untranslated lemmas (with an @ if debugging) in the output, which lead to
>>>> issues with comprehension and post-editing the output.
>>>>
>>>> There is a GSoC project
>>>> <http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Eliminate_trimming>
>>>> which aims to eliminate this trimming and propose a solution such that you
>>>> don't lose the benefits of dictionary trimming as well. In this email I
>>>> will list a summary of the discussion that has taken place up until now.
>>>>
>>>> By trimming the dictionary, you throw away valuable analyses of words
>>>> in the source language, which, if preserved, can be used as context for
>>>> lexical selection and analysis of the input. Also, several transfer
>>>> rules don't match as the word is shown as unknown.
>>>>
>>>> Several solutions are possible for avoiding trimming, some of which
>>>> have been discussed by Unhammer here
>>>> <http://wiki.apertium.org/wiki/Talk:Why_we_trim>. These involve
>>>> keeping the surface form of the source word, and the lemma+analysis as well
>>>> - use the analysis till you need it in the pipe and then propagate the
>>>> source form as an unknown word (like it would be done in trimming).
>>>>
>>>> Another interesting solution that was discussed was that instead of
>>>> just propagating the source surface form, we can output [source-word
>>>> lemma + target morphology], as is shown in this example by Mikel:
>>>>
>>>> Translating from Basque to English:
>>>> "Andonik izarak izeki zuen" ('Andoni hung up the sheets') → 'Andoni
>>>> *izeki-ed the sheets".
>>>>
>>>> This might help in comprehensibility of the output, and to some extent
>>>> even the post-editability.
>>>>
>>>> If you have any significant pros, cons, or suggestions to add for this
>>>> project, you're requested to reply to this thread so that if I work on this
>>>> project, I can do it fully informed.
>>>>
>>>> Thanks and Regards,
>>>> Tanmai Khanna
>>>>
>>>> --
>>>> *Khanna, Tanmai*
>>>>
>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing 
>>>> listApertium-stuff@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>> --
>>>> Mikel L. Forcada  http://www.dlsi.ua.es/~mlf/
>>>> Departament de Llenguatges i Sistemes Informàtics
>>>> Universitat d'Alacant
>>>> E-03690 Sant Vicent del Raspeig
>>>> Spain
>>>> Office: +34 96 590 9776
>>>>
>>>> _______________________________________________
>>>> Apertium-stuff mailing list
>>>> Apertium-stuff@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>>>
>>>
>>>
>>> --
>>> *Khanna, Tanmai*
>>>
>>
>> --
>> Enviat des del meu dispositiu Android amb el K-9 Mail. Disculpeu la
>> brevetat.
>> _______________________________________________
>> Apertium-stuff mailing list
>> Apertium-stuff@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>>
> _______________________________________________
> Apertium-stuff mailing list
> Apertium-stuff@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/apertium-stuff
>


-- 
*Khanna, Tanmai*
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to