Hi,

thanks for the feedback. I'm really only interested in getting a convenient 
access to the dictionaries, so that I can use them for lemmatization. For this 
particular task, I'm not using any other functionality from LanguageTool, 
including grammatical rules.

So here is what I do:

- run a probabilistic POS tagger
- feed the tokens of my text to the languagetool tagger to get all dictionary 
entries
- find a match between the POS tag created by the probabilistic tagger and the 
returned dictionary entries
- if there is a match, use the respective lemma

Matching is the most annoying part, because the tagset used by the 
probabilistic tagger may not be the same as the one used in the LanguageTool 
dictionary. So now try three matching approaches:

- checking if the POS tag from tagger and the one from dictionary are exactly 
the same?
- checking if the POS tag from the tagger is the same as the first element of 
the dictionary tag (splitting by ':')
- using mapping tables to map both, the tag from the POS tagger and the tag 
from the dictionary, to a coarse-grained scheme of word classes and see if they 
match there

Seems to work quite ok.

Cheers,

-- Richard

On 27.01.2014, at 22:38, Marcin Miłkowski <list-addr...@wp.pl> wrote:

> Hello,
> 
> W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze:
>> Hello everybody,
>> 
>> I may be totally wrong, but I believe the lemmatizers in LanguageTool are 
>> implemented based on dictionaries. I suppose a dictionary entry would be 
>> made up of a form, a lemma, and a pos tag.
>> 
>> Assuming this is correct, is there a simple way to do a lookup in such a 
>> dictionary?
>> 
>> Also, is there a way to find out which tagsets are used by these 
>> dictionaries (or maybe there is even some standard in LanguageTool, e.g. 
>> verbs are always V and nouns are always N or something like that)?
>> 
>> I would like a method that accepts an inflected form and a pos tag and that 
>> returns a single lemma.
>> 
>> 
>> Currently, I am doing this, but it seems a bit awkward.
>> 
>> List<AnalyzedTokenReadings> rawTaggedTokens = 
>> lang.getTagger().tag(tokenText);
>> AnalyzedSentence as = new AnalyzedSentence(
>>   rawTaggedTokens.toArray(new 
>> AnalyzedTokenReadings[rawTaggedTokens.size()]));
>> as = lang.getDisambiguator().disambiguate(as);
>> String best = getMostFrequentLemma(as.getTokens()[i]);
>> 
>> In particular, I would like to use a different POS tagger. I have various 
>> statistical POS taggers at my disposal that produce a single POS per token - 
>> and that is what I want. The LanguageTool POS tagger produces multiple 
>> unranked POS tags per token.
> 
> Beware that statistical POS taggers will necessarily obfuscate 
> non-grammatical material, as they try to guess the correct tags. This 
> makes them quite useless for writing rules. We've been there, tried 
> that. I haven't yet found a decent English POS tagger, for example, that 
> would be useful.
> 
> Note however that if you have frequency info, you can add it to your 
> tagger dictionary. And we indeed can do so using typing frequency lists, 
> so you'd be able to assign the most frequent lemma if you need, I guess. 
> The procedure is described here:
> 
> http://wiki.languagetool.org/hunspell-support
> 
> See under "including frequency data".
> 
> Regards,
> Marcin


------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to