Hi It was so interesting to read the responses! I don't want to hog the list, so I'll just pick out 3 of the most engaging points in this post and then sidle back towards lurking mode :-)
Re Fran's trivial stemming being OK for a tagger, but not for an MT system, ----------------------------------------------------------------------------------------------------------- this is indeed a valid point, so the suggestion may not be viable as far as MT goes. However, it is not entirely impractical. I can envisage something like the following, which assumes that the monodixes will have entries for surface and lemma. Taking the relatively rare word "conductress", the process might be as follows: 1. "conductress" is not in the surface column of the English monodix. 2. so, change -ress to -or+f (using a set of regex lookups appropriate to the language) 3. is "conductor" in the surface column of the English monodix? 4. yes, so find its equivalent noun in the other language in the bidix. 5. find that equivalent in the other language's monodix 6. is this equivalent marked f in the gender column? 7. no, so see if there are other noun items with the same lemma 8. are any of them marked f? 9. if so, choose that. 10. if not, use the original find The lemma might hold the masculine singular form of nouns and adjectives, or the infinitive of verbs (or in the case of Swahili loan-words from Arabic, the Arabic 3-letter stem) - this is one of the things that might be decided per language or language-group. In theory this should work, and the main benefit would be to enable guesses to be made about the meaning even if the word is not in the dictionary. For instance, the diminutive -ito/a/os/as in Spanish seems to be frequently used in Latin American Spanish, and since it is both regular and productive, it is nugatory to enter words with it into the dictionary (since in effect the number of words it could be used with is extremely large). Using the above process would generate an English equivalent even it it were not in the dictionary, and if it were considered desirable to carry across the diminutive meaning (which in most cases is not really necessary), you could have another set of lookups as a post-processor on the other side. In English, perhaps something like "[small]" could be added for nouns, "[rather]" for adjectives, eg tiempito - [small] time, bajitos - [rather] low. I accept, though, that this might affect the speed of the translation, which may not be desirable, and that you may get some false positives. Re morpheme boundaries and paradigms, ------------------------------------------------------------ I was not really arguing that the orthographic segmentation should be *replaced* by a morphological segmentation. My point was rather that if morphological segmentation is, as Mikel rightly points out, often neither easy nor clearcut, then there is even less justification for an orthographical segmentation, and that in turn makes paradigms less compelling. For instance, the Welsh verb "canu" (to sing) has 1s present "canaf", 2s present "ceni", but these have to be represented as c/anaf, c/eni, which cuts across the morphemes. There is no point in having a "paradigm" like this, which is nothing to do with the morphology, but is just a coding construct. Neither is there any real logic in linking "deputy" to "baby", as opposed to entering them separately, or having a lookup rule that "Cy +n +sg" --> "Cies +n +pl". It's not that it *can't* be done this way (plainly it can and has been), but that it adds another layer of complexity. And if paradigms don't *have* to be used (which I do know), Occam's Razor says they shouldn't be there at all (except perhaps as some sort of resource for a generator for those languages where they are useful). I take Mikel's point that the paradigms are long-standing, and that you could simplify the paradigm reference to something like "liberación as acción", and also Jimmy's that (assuming the speling format is extensible) the dix generation and future updates can already be done from such a grid behind the scenes (though I do wonder in that case why a GUI has proven difficult to code). But I suppose what I mean is *why* do you need "liberación as acción"? Why not have just: liberación +n +f +sg acción +n +f +sg In other words, each word has its own set of attributes in a grid. If, as Mikel says, it is faster to do it with paradigms, doesn't that have to be weighed against the corresponding issues it may raise in dictionary maintenance? I might add that single-entry attributes are also more extensible - for instance, you could add attributes relating to the semantic space (perhaps from Wordnet) which might help in lexical selection. With paradigms, you either have to create a new one, or qualify it in some way. Re Mikel's point about enclitics in the dictionary, -------------------------------------------------------------------- Suffix-processing is indeed the way I have done it - though, I'm sure others could do it more elegantly. :-) Regexes slice off the possible clitic patterns, and there is a separate entry for the accented verb that is left (so you have "máta +v +2s +imper +preclitic" as well as "mata +v +2s +imper"). FWIW, I agree with all of Fran's conclusions. :-) -- Pob hwyl / Best wishes Kevin Donnelly kevindonnelly.org.uk ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
