Hi, thanks for the feedback. I'm really only interested in getting a convenient access to the dictionaries, so that I can use them for lemmatization. For this particular task, I'm not using any other functionality from LanguageTool, including grammatical rules.
So here is what I do: - run a probabilistic POS tagger - feed the tokens of my text to the languagetool tagger to get all dictionary entries - find a match between the POS tag created by the probabilistic tagger and the returned dictionary entries - if there is a match, use the respective lemma Matching is the most annoying part, because the tagset used by the probabilistic tagger may not be the same as the one used in the LanguageTool dictionary. So now try three matching approaches: - checking if the POS tag from tagger and the one from dictionary are exactly the same? - checking if the POS tag from the tagger is the same as the first element of the dictionary tag (splitting by ':') - using mapping tables to map both, the tag from the POS tagger and the tag from the dictionary, to a coarse-grained scheme of word classes and see if they match there Seems to work quite ok. Cheers, -- Richard On 27.01.2014, at 22:38, Marcin Miłkowski <list-addr...@wp.pl> wrote: > Hello, > > W dniu 2014-01-27 15:44, Richard Eckart de Castilho pisze: >> Hello everybody, >> >> I may be totally wrong, but I believe the lemmatizers in LanguageTool are >> implemented based on dictionaries. I suppose a dictionary entry would be >> made up of a form, a lemma, and a pos tag. >> >> Assuming this is correct, is there a simple way to do a lookup in such a >> dictionary? >> >> Also, is there a way to find out which tagsets are used by these >> dictionaries (or maybe there is even some standard in LanguageTool, e.g. >> verbs are always V and nouns are always N or something like that)? >> >> I would like a method that accepts an inflected form and a pos tag and that >> returns a single lemma. >> >> >> Currently, I am doing this, but it seems a bit awkward. >> >> List<AnalyzedTokenReadings> rawTaggedTokens = >> lang.getTagger().tag(tokenText); >> AnalyzedSentence as = new AnalyzedSentence( >> rawTaggedTokens.toArray(new >> AnalyzedTokenReadings[rawTaggedTokens.size()])); >> as = lang.getDisambiguator().disambiguate(as); >> String best = getMostFrequentLemma(as.getTokens()[i]); >> >> In particular, I would like to use a different POS tagger. I have various >> statistical POS taggers at my disposal that produce a single POS per token - >> and that is what I want. The LanguageTool POS tagger produces multiple >> unranked POS tags per token. > > Beware that statistical POS taggers will necessarily obfuscate > non-grammatical material, as they try to guess the correct tags. This > makes them quite useless for writing rules. We've been there, tried > that. I haven't yet found a decent English POS tagger, for example, that > would be useful. > > Note however that if you have frequency info, you can add it to your > tagger dictionary. And we indeed can do so using typing frequency lists, > so you'd be able to assign the most frequent lemma if you need, I guess. > The procedure is described here: > > http://wiki.languagetool.org/hunspell-support > > See under "including frequency data". > > Regards, > Marcin ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel