Before anything, let me say that I like the proposal to enhance the pipeline with more data (including, but not limited to the surface forms), to be able to do properly do things that currently we're doing in veeeery hacky (to me) and definitely non-linguistic ways
xavi@dell:~/src/apertium-spa$ echo "El mango" | apertium -d . spa-morph ^El/el<det><def><m><sg>$ ^mango/mango<n><m><sg>/mangar<vblex><pri><p1><sg>/ *mango_fruta<n><m><sg>*$^./.<sent>$ In this example, we "add" semantic information to the pipeline (and disambiguate via CG3) by creating a "fake lemma" needed for SPA-CAT, because "mango<n>" (pan stick) and "mango_fruta<n>" are translated differently in Catalan. But this, in turn, forces every other language pair using Spanish to know about "mango_fruta<n>" even if the translation was the same as "mango<n>". And yes, I know this example could also be solved by using lex-tools, where the translation would change based on the context. But "the rules" to decide if it's "mango<n>" or "mango_fruta<n>" do not depend on the translation, but completely on the source language. Ideally, I'd like to have a module in apertium-spa that allows me to add this semantic information (that can perfectly be one instance of lex-tools), and then be able to use it (or not) in different language pairs. I was going to talk about the points Fran raises, that I think are extremely valuable. But I think Tanmai's answer (that came while I was writing this) addresses them better than I would. With source identifiers, we can keep the compounds and contractions information as it was in the source, and then decide what to do with it. But I also don't think we can ask this implementation to solve all current and future problems apertium pipeline format has. As long as backwards compatibility is ensured, I don't see why having "more data" available can generate any problem. And if, for any reason, it turns out that for the specific problem of passing over the surface form can't be used in all cases, I still think "being able to do it" (again, while ensuring backwards compatibility) is worth for the cases that will be useful (and, again, for non-developed pairs with extremely developed monolingual dictionaries, being able to avoid trimming to pass morphological information to the transfer would be a HUGE win). -- < Xavi Ivars > < http://xavi.ivars.me >
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff