Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Kevin Brubeck Unhammer Wed, 16 Nov 2011 00:49:25 -0800

Mikel Forcada <[email protected]> writes:

> HI there.
>
> A quick quirk. I may go through the rest of the message later.
>
> Al 11/15/2011 12:16 AM, En/na Kevin Donnelly ha escrit:
>> Re Fran's trivial stemming being OK for a tagger, but not for an MT system,
>> -----------------------------------------------------------------------------------------------------------
>> this is indeed a valid point, so the suggestion may not be viable as far as 
>> MT
>> goes.
>>
>> However, it is not entirely impractical.  I can envisage something like the
>> following, which assumes that the monodixes will have entries for surface and
>> lemma. Taking the relatively rare word "conductress", the process might be as
>> follows:
>> 1. "conductress" is not in the surface column of the English monodix.
>> 2. so, change -ress to -or+f (using a set of regex lookups appropriate to the
>> language)
>> 3. is "conductor" in the surface column of the English monodix?
>> 4. yes, so find its equivalent noun in the other language in the bidix.
>> 5. find that equivalent in the other language's monodix
>> 6. is this equivalent marked f in the gender column?
>> 7. no, so see if there are other noun items with the same lemma
>> 8. are any of them marked f?
>> 9. if so, choose that.
>> 10. if not, use the original find
>> The lemma might hold the masculine singular form of nouns and adjectives, or
>> the infinitive of verbs (or in the case of Swahili loan-words from Arabic, 
>> the
>> Arabic 3-letter stem) - this is one of the things that might be decided per
>> language or language-group.
>>
>> In theory this should work, and the main benefit would be to enable guesses 
>> to
>> be made about the meaning even if the word is not in the dictionary.  For
>> instance, the diminutive -ito/a/os/as in Spanish seems to be frequently used
>> in Latin American Spanish, and since it is both regular and productive, it is
>> nugatory to enter words with it into the dictionary (since in effect the 
>> number
>> of words it could be used with is extremely large).  Using the above process
>> would generate an English equivalent even it it were not in the dictionary,
>> and if it were considered desirable to carry across the diminutive meaning
>> (which in most cases is not really necessary), you could have another set of
>> lookups as a post-processor on the other side.  In English, perhaps something
>> like "[small]" could be added for nouns, "[rather]" for adjectives, eg
>> tiempito - [small] time, bajitos - [rather] low.
>>
>> I accept, though, that this might affect the speed of the translation, which
>> may not be desirable, and that you may get some false positives.
> This is basically how ispell and other spell checkers work (though, 
> granted, only for suffix morphology) and in fact there was a Portuguese 
> group that built something called jspell that did output morphological 
> information as part of spell checking. I think their GPL Portuguese 
> dictionary was used for Portuguese dictionaries in Apertium but I am not 
> sure.
>
> Two problems come to mind:
>
> (1) many of Kevin's transformation rules can match an entry; this may be 
> computationally more intensive (compare to finite-state-transducers as 
> used in Apertium). However, maybe they don't have to be run at runtime: 
> they could be massively applied at compile time to generate a .dix. I 
> would have to think harder about this.


Should be fairly simple to run a corpus and find all "-ito" words that
have "non-ito" nouns, etc.; ie. just run the process described "offline"
as a method of increasing dictionary size. I'll bet Jimmy has performed
variations on this theme already :)

> (2) you may get lexical forms which have no match in the bilingual 
> dictionary! The only way to avoid this would be to mark some of the 
> surface forms as lemmas and make sure there are none which is not in the 
> dictionary.

In the Northern Sámi→Norwegian Bokmål (sme→nob) translator (which is
definitely meant for gisting rather than post-editing), the analyser
does some of this, ie. "online" / "productive" derivations. However, we
have the principle that the lemma is never changed, only tags are added.

The sme analyser uses HFST, but technically there isn't a problem with
representing the derivation 'tiempito' in lttoolbox in exact the same
way as in HFST[1]. Instead of

    <pardef n="abismo__n">
      <e><p><l>s</l><r><s n="n"/><s n="m"/><s n="pl"/></r></p></e>
      […]
    <e lm="tiempo"><i>tiempo</i><par n="abismo__n"/>    </e>

you would have something like

    <pardef n="abism/o__n">
      <e><p><l>os</l><r><s n="n"/><s n="m"/><s n="pl"/></r></p></e>
      <e><p><l>itos</l><r><s n="n"/><s n="m"/><s n="pl"/><s n="pl"/><s 
n="dim"/></r></p></e>
      […]
    <e lm="tiempo"><i>tiemp</i><par n="abism/o__n"/>    </e>

Thus 'tiempito' would still have the lemma 'tiempo', and transfer would
have to deal with the <dim> (or whatever) tag. 


In this way you could add 'small' in transfer and get 'a small time',
but perhaps 'a short while' would be better? In the worst case, a
bilingual post-editor would have the difference between:

1. read "… *tiempito …"
2. edit to "… a short while …"

and 

1. read "… a small time …" (and experience a feeling of dissonance)
2. check source text and find it was 'tiempito'
3. edit to "… a short while …"


Another problem is that transfer becomes more complex, do you insert
'small' before or after other adjectives (adverbs, preadverbs) in a
chunk? You now have to think about this for every possible noun chunking
rule.


Unfortunately, the original sme analyser which we use as a basis in
sme-nob is even more complex, since it is meant to cover pretty much all
productive derivations. It's very good for annotating and disambiguating
a corpus, but without modifications it is too complex for MT.

"All productive derivations" includes derivations that change the
part-of-speech, and compounds, and derivations of derivations … If you
can have a diminutive of a deverbal noun, you have to think about how to
add 'small' in all rules, including those that originally were meant for
verbs (a transfer rule pattern matching "v.*" will match
"v.derivation.n.*").

In sme→nob, we restrict the possible derivations in the analyser quite a
lot, and only stick to a small set of single derivations (no derivations
of derivations) for which it is easy to find a way of rewriting that
sounds alright and doesn't induce too much transfer complexity. Even so,
most of the time spent debugging transfer/bidix and the analyser stems
from derivations, and if I spoke Sámi I'm pretty sure my time would have
been better spent on adding words to bidix rather than on trying to
juggle all the possible ways in which derivations interact.

However, derivations can work if 
(1) the derivation is high enough frequency, and
(2) it is possible to deal with it in transfer (and the analyser) in a
    simple way, and 
(3) it is possible to make the translation sound good
    while preserving the meaning, and
(4) the translator is meant for gisting/assimilation, not
    post-editing/dissemination, and 
(5) you have a lot of time on your hands.

Elsewise, I'm not sure it's worth it.


[1] The main thing HFST adds is 'flag diacritics', which basically allow
    you to put restrictions on which tags can go together in one
    analysis. Thus you could put optional diminutives at the end of
    _all_ noun analyses, and if there's a certain noun that can't have a
    diminutive, you just put a special 'hidden tag' on that particular
    noun in its <section>, the diminutive line of the noun pardef then
    adds another 'hidden' tag that is incompatible with the first one,
    and doesn't allow analyses that contain both tags. You could acheive
    the same effect in lttoolbox by duplicating all noun pardefs into
    with_diminutive- and without_diminutive-versions.


best regards,
Kevin Brubeck Unhammer


------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] Dictionaries, coverage and other dull tasks

Reply via email to