Hi Maciej, Stempel uses a pretrained heuristic. You can find a longer description at [1] and [2]. The specific reason for the problems you mentioned may be the smaller training dictionary used for the version embedded in Lucene, I honestly don't know. If you need exact stemming/ lemmatization then take a look at dictionary methods -- Morfologik or the tools listed at [3].
Dawid [1] http://www.getopt.org/stempel/ [2] https://lucene.apache.org/core/8_2_0/analyzers-stempel/index.html [3] http://zil.ipipan.waw.pl/ On Tue, Sep 10, 2019 at 9:31 PM Maciej Gawinecki <mgawine...@gmail.com> wrote: > > Hi, > > I have just checked out the latest version of Lucene from Git master branch. > > I have tried to stem a few words using StempelStemmer for Polish. > However, it looks it cannot handle some words properly, e.g. > > joyce -> ąć > wielce -> ąć > piwko -> ąć > royce -> ąć > pip -> ąć > xyz -> xyz > > 1. I surprised it cannot handle Polish words like wielce, piwko and > royce. Is this a limitation of the stemming algorithm or a training of > the algorithm or something else? The latter would help improve the > situation. How can I improve that behaviour? > 2. I am surprised that for non-Polish words it returns "ać". I would > expect that for words it has not be trained for it will return their > original forms, as it happens, for instance, when stemming words like > "xyz". > > With kind regards, > Maciej Gawinecki > > Here's minimal example to reproduce the issue: > > package org.apache.lucene.analysis; > > import java.io.InputStream; > import org.apache.lucene.analysis.stempel.StempelStemmer; > > public class Try { > > public static void main(String[] args) throws Exception { > InputStream stemmerTabke = ClassLoader.getSystemClassLoader() > > .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl"); > StempelStemmer stemmer = new StempelStemmer(stemmerTabke); > String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"}; > for (String word : words) { > System.out.println(String.format("%s -> %s", word, > stemmer.stem("piwko"))); > } > > } > > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org