Kevin Donnelly <[email protected]> writes:

> Hi
>
> ::::On Wednesday 16 November 2011 Kevin Brubeck Unhammer said::::
>> just run the process described "offline" as a method of increasing 
>> dictionary size.
>
> True, though the output will then depend on what was in your dictionary to 
> begin with, and therefore be non-dynamic when meeting new words of the same 
> pattern.

You mean what was in the _corpus_? Whatever method you use online would
depend as much or as little on the dictionary as what method you use
offline.

> Re difficulties with trivial stemming, I need to clarify that adding 
> "small/rather"  for -ito would not be my preferred option - I would in fact 
> rather leave the diminutive aspect untranslated, since nuances like this 
> present problems, but I mentioned inserting "small/rather" as a possibility.

Yeah … in sme→nob we just turn <n><der_dimin><n> into <n> :)

> That being said, however ... the attached file contains -ito items from the 
> Miami and Patagonia corpora, along with the transcribed text and English 
> translations where the researchers have completed them.[1]  

Nice :) 

> Inspecting actual 
> data shows that in fact "small/rather" will match the meaning pretty well, 
> apart from "tiempito" and "ratito".  (And my first thought here would be to 
> handle these in some sort of "smoothing" post-processor, where "a small time" 
> gets rewritten to "a short while".)

If you already have a good translation, just adding it to the dictionary
would be the simplest and safest method. Then it's just a matter of
ensuring the PoS disambiguator blocks/removes derivations if there are
other analyses.

> So I would be more sanguine than KBU about the possibilities for limited 
> deployment of additional lexemes in a trivial stemming approach.
>
>> "All productive derivations" includes derivations that change the
>> part-of-speech, and compounds, and derivations of derivations … If you
>> can have a diminutive of a deverbal noun, you have to think about how to
>> add 'small' in all rules, including those that originally were meant for
>> verbs (a transfer rule pattern matching "v.*" will match
>> "v.derivation.n.*").
>
> I don't really follow this, I'm afraid, since if you have a deverbal noun its 
> predominant tag should surely be "noun", not "verb"?  Is there a linguistic 
> basis in Sámi for still considering a deverbative noun a verb?

The tag order is just because of how transducers work: read some
letters, match a verb stem, write a verb tag, read some more letters,
find out it's turned into a noun, write a derivation tag and a noun tag.
In the sme analyser, the word is considered to be the PoS which doesn't
have any derivational tags after it, and in the original use they would
change/remove the preceding PoS tags in post-processing. Unfortunately,
for translation, we have to look up using both the lemma and the
_original_ PoS tag (unless you want to add all possible deverbal nouns
in the bidix with noun tags, putting you back to square one; or not
specify PoS in bidix, making PoS disambiguation useless). Similarly for
target language generation, we can't assume that the target language has
the same possibility of producing a deverbal noun using that same verb
stem, we can use the target language lemma as a verb lemma or not at
all. So in apertium-sme-nob, we keep PoS tag sequence, it's just the
lesser of evils.


-KBU


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to