El 2020-06-13 15:20, Tino Didriksen escribió:
I would like everyone to read and seriously consider this thread and
give your opinion. This meanders a bit, so please read it all.
Here is a non-exhaustive list of potential pitfalls of using the
"surface
form is a tag" thing. As far as I understand the objective is to be able
to
put the original surface form in the output translation as an unknown
token
instead of the lemma.
0) languages without spaces in the writing system:
what is a surface form here? is it just the longest token matched?
1) compounds
i) infrastruktuurontwikkelingsplan, does each part of the compound get
the surface form tag? if so, one happens if one part of the compound
is translated but the other parts aren't, e.g. would you get
*infrastruktuurontwikkelingsplan *infrastruktuurontwikkelingsplan
plan?
2) contractions
i) chawe - if you attach the surface form to both and both are unknown,
do
you get both in the output? if you only attach it to one, which one
do you
attach it to, where is that decision made?
ii) dárselo - if you attach the surface form to the clitic pronouns in
addition
to the verb, what happens if the verb is not in the dictionary but
the clitic
pronouns are? do you get the surface form and the translations in the
output?
I think that the appropriate way to deal with this is by coming up with
a
clear plan for the linguistic eventualities. I don't see that in the
current
proposal. I have been showing Tanmai through the creation of a new MT
system,
and we have been documenting these issues as they arise. I don't think
it makes
sense to start development before they have been resolved.
Fran
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff