Kevin Donnelly <[email protected]> writes: > Hi > > ::::On Wednesday 16 November 2011 Kevin Brubeck Unhammer said:::: >> just run the process described "offline" as a method of increasing >> dictionary size. > > True, though the output will then depend on what was in your dictionary to > begin with, and therefore be non-dynamic when meeting new words of the same > pattern.
You mean what was in the _corpus_? Whatever method you use online would depend as much or as little on the dictionary as what method you use offline. > Re difficulties with trivial stemming, I need to clarify that adding > "small/rather" for -ito would not be my preferred option - I would in fact > rather leave the diminutive aspect untranslated, since nuances like this > present problems, but I mentioned inserting "small/rather" as a possibility. Yeah … in sme→nob we just turn <n><der_dimin><n> into <n> :) > That being said, however ... the attached file contains -ito items from the > Miami and Patagonia corpora, along with the transcribed text and English > translations where the researchers have completed them.[1] Nice :) > Inspecting actual > data shows that in fact "small/rather" will match the meaning pretty well, > apart from "tiempito" and "ratito". (And my first thought here would be to > handle these in some sort of "smoothing" post-processor, where "a small time" > gets rewritten to "a short while".) If you already have a good translation, just adding it to the dictionary would be the simplest and safest method. Then it's just a matter of ensuring the PoS disambiguator blocks/removes derivations if there are other analyses. > So I would be more sanguine than KBU about the possibilities for limited > deployment of additional lexemes in a trivial stemming approach. > >> "All productive derivations" includes derivations that change the >> part-of-speech, and compounds, and derivations of derivations … If you >> can have a diminutive of a deverbal noun, you have to think about how to >> add 'small' in all rules, including those that originally were meant for >> verbs (a transfer rule pattern matching "v.*" will match >> "v.derivation.n.*"). > > I don't really follow this, I'm afraid, since if you have a deverbal noun its > predominant tag should surely be "noun", not "verb"? Is there a linguistic > basis in Sámi for still considering a deverbative noun a verb? The tag order is just because of how transducers work: read some letters, match a verb stem, write a verb tag, read some more letters, find out it's turned into a noun, write a derivation tag and a noun tag. In the sme analyser, the word is considered to be the PoS which doesn't have any derivational tags after it, and in the original use they would change/remove the preceding PoS tags in post-processing. Unfortunately, for translation, we have to look up using both the lemma and the _original_ PoS tag (unless you want to add all possible deverbal nouns in the bidix with noun tags, putting you back to square one; or not specify PoS in bidix, making PoS disambiguation useless). Similarly for target language generation, we can't assume that the target language has the same possibility of producing a deverbal noun using that same verb stem, we can use the target language lemma as a verb lemma or not at all. So in apertium-sme-nob, we keep PoS tag sequence, it's just the lesser of evils. -KBU ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
