El 2020-06-13 15:20, Tino Didriksen escribió:
I would like everyone to read and seriously consider this thread and
give your opinion. This meanders a bit, so please read it all.


Here is a non-exhaustive list of potential pitfalls of using the "surface form is a tag" thing. As far as I understand the objective is to be able to put the original surface form in the output translation as an unknown token
instead of the lemma.

0) languages without spaces in the writing system:

   what is a surface form here? is it just the longest token matched?

1) compounds

i)  infrastruktuurontwikkelingsplan, does each part of the compound get
    the surface form tag? if so, one happens if one part of the compound
    is translated but the other parts aren't, e.g. would you get
*infrastruktuurontwikkelingsplan *infrastruktuurontwikkelingsplan plan?

2) contractions

i) chawe - if you attach the surface form to both and both are unknown, do you get both in the output? if you only attach it to one, which one do you
    attach it to, where is that decision made?

ii) dárselo - if you attach the surface form to the clitic pronouns in addition to the verb, what happens if the verb is not in the dictionary but the clitic pronouns are? do you get the surface form and the translations in the output?

I think that the appropriate way to deal with this is by coming up with a clear plan for the linguistic eventualities. I don't see that in the current proposal. I have been showing Tanmai through the creation of a new MT system, and we have been documenting these issues as they arise. I don't think it makes
sense to start development before they have been resolved.

Fran


_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to