[
https://issues.apache.org/jira/browse/LUCENE-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341014#comment-15341014
]
Andriy Rysin commented on LUCENE-7348:
--------------------------------------
[~mikemccand] Hey Michael,
I've analyzed the inflection rules we have in dict_uk project
(https://github.com/arysin/dict_uk) and it has ~4500 inflection rules (most of
those are simple match but some are regexps). Those rules cover almost all
possible affixes. I can probably drop rare and homonimic ones to make it below
4k but then the question comes up where to go next?
1) having all the rules would be nice as it'll provide high accuracy and high
level of compatibility with the dictionary-based lemmatizer created in
LUCENE-7287 (we could probably even make a hybrid solution)
2) having smaller/simpler will benefit the performance (but to simplify it
properly we would have to analyze the frequency/importance of each rule)
3) is lemmatizing analysis good or stemming is preferred? for real stemming we
would have to work more on the rules to find the (pseudo)roots for each
inflection rule
I tried to look at existing light stemmers and many are very basic. It looks
like we're going in reverse and I am trying to understand if already having
complex solution we want to make it simpler (it looks that the only benefit
will be performance)? I also tried to google on how to do the stemming "right"
but nothing serious jumped at me especially applicable for Slavic languages.
Thanks.
> Add dynamic stemmer for Ukrainian
> ---------------------------------
>
> Key: LUCENE-7348
> URL: https://issues.apache.org/jira/browse/LUCENE-7348
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Andriy Rysin
> Priority: Minor
> Labels: analysis, language
>
> We're adding a dictionary based lemmatizing analyzer for Ukrainian in
> https://issues.apache.org/jira/browse/LUCENE-7287.
> It would be nice to have a dynamic stemmer that can handle words that are not
> in the dictionary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]