Before anything, let me say that I like the proposal to enhance the
pipeline with more data (including, but not limited to the surface forms),
to be able to do properly do things that currently we're doing in veeeery
hacky (to me) and definitely non-linguistic ways

xavi@dell:~/src/apertium-spa$ echo "El mango" | apertium -d . spa-morph
^El/el<det><def><m><sg>$ ^mango/mango<n><m><sg>/mangar<vblex><pri><p1><sg>/
*mango_fruta<n><m><sg>*$^./.<sent>$


In this example, we "add" semantic information to the pipeline (and
disambiguate via CG3) by creating a "fake lemma" needed for SPA-CAT,
because "mango<n>" (pan stick) and "mango_fruta<n>" are translated
differently in Catalan. But this, in turn, forces every other language pair
using Spanish to know about "mango_fruta<n>" even if the translation was
the same as "mango<n>".

And yes, I know this example could also be solved by using lex-tools, where
the translation would change based on the context. But "the rules" to
decide if it's "mango<n>" or "mango_fruta<n>" do not depend on the
translation, but completely on the source language. Ideally, I'd like to
have a module in apertium-spa that allows me to add this semantic
information (that can perfectly be one instance of lex-tools), and then be
able to use it (or not) in different language pairs.

I was going to talk about the points Fran raises, that I think are
extremely valuable. But I think Tanmai's answer (that came while I was
writing this) addresses them better than I would. With source identifiers,
we can keep the compounds and contractions information as it was in the
source, and then decide what to do with it.

But I also don't think we can ask this implementation to solve all current
and future problems apertium pipeline format has. As long as backwards
compatibility is ensured, I don't see why having "more data" available can
generate any problem. And if, for any reason, it turns out that for the
specific problem of passing over the surface form can't be used in all
cases, I still think "being able to do it" (again, while ensuring backwards
compatibility) is worth for the cases that will be useful (and, again, for
non-developed pairs with extremely developed monolingual dictionaries,
being able to avoid trimming to pass morphological information to the
transfer would be a HUGE win).

-- 
< Xavi Ivars >
< http://xavi.ivars.me >
_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to