Re: Limitations of StempelStemmer

Martin Grigorov Tue, 24 Sep 2019 14:00:39 -0700

Hi,

On Tue, Sep 10, 2019, 22:31 Maciej Gawinecki <[email protected]> wrote:


> Hi,
>
> I have just checked out the latest version of Lucene from Git master
> branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ąć
> wielce -> ąć
> piwko -> ąć
> royce -> ąć
> pip -> ąć
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "ać". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
>   public static void main(String[] args) throws Exception {
>     InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
>
> .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
>     StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
>     String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
>     for (String word : words) {
>       System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
>

You always pass "piwko" for stemming.

    }
>
>   }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Limitations of StempelStemmer

Reply via email to