Per Tunedal <[email protected]>
writes:

[...]

> The noun "kjempe" is advertised as possible to use in compounds, yet
> there is an entry for the adjective "kjempehøy" (= very high/tall). Why?

Assume you have dynamic[1] compounding turned on for the open classes
nouns, verbs, adjectives – these are all fairly common in compounding
(though nouns cover over 70 % in nn/nb), and you remove "kjempehøy" from
your dictionary.

Now, since nb.dix has these analysis of "kjempe" and "høy":

    
kjempe<vblex><inf>/kjempe<n><m><sg><ind>/kjempe<n><f><sg><ind>/kjempe<n><m><sg><ind>/kjempe<n><f><sg><ind>
    
høye<vblex><imp>/høy<n><nt><sg><ind>/høy<n><nt><pl><ind>/høy<adj><posi><mf><sg><ind>

your compound analysis will be ambiguous over at least:

    kjempe<n><f><sg><ind>+høy<n><nt><pl><ind>
    kjempe<n><f><sg><ind>+høy<n><nt><sg><ind>
    kjempe<n><f><sg><ind>+høye<vblex><imp>
    kjempe<n><f><sg><ind>+høy<adj><posi><mf><sg><ind>
    kjempe<n><m><sg><ind>+høy<n><nt><sg><ind>
    kjempe<n><m><sg><ind>+høy<n><nt><pl><ind>
    kjempe<n><m><sg><ind>+høye<vblex><imp>
    kjempe<n><m><sg><ind>+høy<adj><posi><mf><sg><ind>
    kjempe<vblex><inf>+høy<n><nt><pl><ind>
    kjempe<vblex><inf>+høy<n><nt><sg><ind>
    kjempe<vblex><inf>+høye<vblex><imp>
    kjempe<vblex><inf>+høy<adj><posi><mf><sg><ind>

And it gets even worse if there's some possibility of segmenting at the
pwrong place, e.g. Bokmål 'te+skje' (tea+spoon) could be mis-segmented
'te+s+kje' (tea+epenthetic+kid goat), similarly 'bilde+liste'
(image+list) vs 'bildel+iste' (image+iced/image+ice tea).

Compare this with the ambiguity-count of the analysis given when we do
have "kjempehøy" in the dictionary:

kjempehøy<adj><posi><mf><sg><ind>

Only one analysis, and it's the correct one. 

So you avoid useless ambiguity by adding more compounds. Useless
ambiguity is harmful not only to the translation of that word, but of
the context (given the seqence "<adj> <vblex>/<n>", it's easy to see
that the second word is most likely a noun, not so with
"<adj>/<n>/<vblex> <vblex>/<n>").


In addition to all that, a decompounding analysis takes a lot longer per
word than a simple analysis (you have to check all the possible ways of
segmenting the word into two parts, then three parts, etc.), and the
fact that adding full compound words further helps decompounding
compounds of compounds (it's safer and faster to segment
'bildeliste+generator' than 'bilde+liste+generator', where you might end
up with 'bildel+iste+generator').

Aaand, finally, some times the sum is greater than the parts, e.g.
Bokmål 'kjempemessig' might be better translated to 'ovstor' or 'diger'
in Nynorsk, 'bedømmelseskommité'→'domsnemnd' etc.


In summary: Dynamic compounding leads to more ambiguity and slower
analysis, and is thus used only when there is no lexicalised analysis.
Adding lexicalised compounds improves not only analysis of those
compounds and their contexts, but also improves dynamic compounding of
longer compounds.


> BTW I've found only one similar Danish word: "kæmpestor" (very large). I
> don't know if there are any more.

If "kæmpe-" is not very productive in Danish, it might be better to
translate those words into something else (kjempelett→pærelet,
kjempegod→knippelgod?). Adding such pairs as lexicalised compounds in
the dictionaries will override dynamic compounding for those words.



[1] Dynamic compounding is when the analyser only contains the parts and
    guesses how they fit together, lexicalised compounds are defined as
    those we spell out completely in the dictionary.


-- 
Kevin Brubeck Unhammer

GPG: 0x766AC60C


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_nov
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to