Hi

It was so interesting to read the responses!  I don't want to hog the list, so 
I'll just pick out 3 of the most engaging points in this post and then sidle 
back towards lurking mode :-)


Re Fran's trivial stemming being OK for a tagger, but not for an MT system,
-----------------------------------------------------------------------------------------------------------
this is indeed a valid point, so the suggestion may not be viable as far as MT 
goes.

However, it is not entirely impractical.  I can envisage something like the 
following, which assumes that the monodixes will have entries for surface and 
lemma. Taking the relatively rare word "conductress", the process might be as 
follows:
1. "conductress" is not in the surface column of the English monodix.
2. so, change -ress to -or+f (using a set of regex lookups appropriate to the 
language)
3. is "conductor" in the surface column of the English monodix?
4. yes, so find its equivalent noun in the other language in the bidix.
5. find that equivalent in the other language's monodix
6. is this equivalent marked f in the gender column?
7. no, so see if there are other noun items with the same lemma
8. are any of them marked f?
9. if so, choose that.
10. if not, use the original find
The lemma might hold the masculine singular form of nouns and adjectives, or 
the infinitive of verbs (or in the case of Swahili loan-words from Arabic, the 
Arabic 3-letter stem) - this is one of the things that might be decided per 
language or language-group.

In theory this should work, and the main benefit would be to enable guesses to 
be made about the meaning even if the word is not in the dictionary.  For 
instance, the diminutive -ito/a/os/as in Spanish seems to be frequently used 
in Latin American Spanish, and since it is both regular and productive, it is 
nugatory to enter words with it into the dictionary (since in effect the number 
of words it could be used with is extremely large).  Using the above process 
would generate an English equivalent even it it were not in the dictionary, 
and if it were considered desirable to carry across the diminutive meaning 
(which in most cases is not really necessary), you could have another set of 
lookups as a post-processor on the other side.  In English, perhaps something 
like "[small]" could be added for nouns, "[rather]" for adjectives, eg 
tiempito - [small] time, bajitos - [rather] low.

I accept, though, that this might affect the speed of the translation, which 
may not be desirable, and that you may get some false positives.


Re morpheme boundaries and paradigms,
------------------------------------------------------------
I was not really arguing that the orthographic segmentation should be 
*replaced* by a morphological segmentation.  My point was rather that if 
morphological segmentation is, as Mikel rightly points out, often neither easy 
nor clearcut, then there is even less justification for an orthographical 
segmentation, and that in turn makes paradigms less compelling. 

For instance, the Welsh verb "canu" (to sing) has 1s present "canaf", 2s 
present "ceni", but these have to be represented as c/anaf, c/eni, which cuts 
across the morphemes.  There is no point in having a "paradigm" like this, 
which is nothing to do with the morphology, but is just a coding construct.   
Neither is there any real logic in linking "deputy" to "baby", as opposed to 
entering them separately, or having a lookup rule that "Cy +n +sg" --> "Cies 
+n +pl".  It's not that it *can't* be done this way (plainly it can and has 
been), but that it adds another layer of complexity.  And if paradigms don't 
*have* to be used (which I do know), Occam's Razor says they shouldn't be 
there at all (except perhaps as some sort of resource for a generator for 
those languages where they are useful).

I take Mikel's point that the paradigms are long-standing, and that you could 
simplify the paradigm reference to something like "liberación as acción", and 
also Jimmy's that (assuming the speling format is extensible) the dix 
generation and future updates can already be done from such a grid behind the 
scenes (though I do wonder in that case why a GUI has proven difficult to code).

But I suppose what I mean is *why* do you need "liberación as acción"?  Why 
not have just:
liberación +n +f +sg
acción +n +f +sg 
In other words, each word has its own set of attributes in a grid.  If, as 
Mikel says, it is faster to do it with paradigms, doesn't that have to be 
weighed against the corresponding issues it may raise in dictionary 
maintenance?  I might add that single-entry attributes are also more 
extensible - for instance, you could add attributes relating to the semantic 
space (perhaps from Wordnet) which might help in lexical selection.  With 
paradigms, you either have to create a new one, or qualify it in some way.


Re Mikel's point about enclitics in the dictionary, 
--------------------------------------------------------------------
Suffix-processing is indeed the way I have done it - though, I'm sure others 
could do it more elegantly. :-)  Regexes slice off the possible clitic 
patterns, and there is a separate entry for the accented verb that is left (so 
you have "máta +v +2s +imper +preclitic" as well as "mata +v +2s +imper").


FWIW, I agree with all of Fran's conclusions. :-)

-- 
Pob hwyl / Best wishes

Kevin Donnelly
kevindonnelly.org.uk

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to