[jira] [Created] (SOLR-11835) Adjust instructions for Ukrainian on LanguageAnalysis page
Andriy Rysin created SOLR-11835: --- Summary: Adjust instructions for Ukrainian on LanguageAnalysis page Key: SOLR-11835 URL: https://issues.apache.org/jira/browse/SOLR-11835 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Andriy Rysin Since Lucene 6.6 the dictionary for Ukrainian analyzer contains all proper names in lowercase, this seems to be much better way to have it for searching. Can we please move LowerCaseFilterFactory back before MorfologikFilterFactory at https://lucene.apache.org/solr/guide/6_6/language-analysis.html#LanguageAnalysis-Ukrainian? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7973) Update dictionary version for Ukrainian analyzer to 3.9.0
[ https://issues.apache.org/jira/browse/LUCENE-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195876#comment-16195876 ] Andriy Rysin commented on LUCENE-7973: -- It looks like I need to remove my 3 pull requests from above now, right? > Update dictionary version for Ukrainian analyzer to 3.9.0 > - > > Key: LUCENE-7973 > URL: https://issues.apache.org/jira/browse/LUCENE-7973 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Assignee: Dawid Weiss >Priority: Minor > Fix For: 7.1 > > > Update morfologik dictionary version to 3.9.0 for Ukrainian analyzer. > There's 60K of new lemmas there along with some other improvements and fixes, > particularly Ukrainian town names have been synchronized with official > standard. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7973) Update dictionary version for Urainian analyzer
Andriy Rysin created LUCENE-7973: Summary: Update dictionary version for Urainian analyzer Key: LUCENE-7973 URL: https://issues.apache.org/jira/browse/LUCENE-7973 Project: Lucene - Core Issue Type: Improvement Reporter: Andriy Rysin Update morfologik dictionary version to 3.9.0 for Ukrainian analyzer. There's 60K of new lemmas there along with some other improvements and fixes, particularly Ukrainian town names have been synchronized with official standard. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7841) Normalize ґ to г in Ukrainian analyzer
[ https://issues.apache.org/jira/browse/LUCENE-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021165#comment-16021165 ] Andriy Rysin commented on LUCENE-7841: -- Thanks Dawid, I've pushed the checksum and change file changes on all 3 tracks, `ant precommit` runs clean now. > Normalize ґ to г in Ukrainian analyzer > -- > > Key: LUCENE-7841 > URL: https://issues.apache.org/jira/browse/LUCENE-7841 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Priority: Minor > > Letter ґ was re-introduced into Ukrainian alphabet in 1990 and many Ukrainian > texts don't use this letter consistently so the search will benefit if we > normalize it to г. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7841) Normalize ґ to г in Ukrainian analyzer
Andriy Rysin created LUCENE-7841: Summary: Normalize ґ to г in Ukrainian analyzer Key: LUCENE-7841 URL: https://issues.apache.org/jira/browse/LUCENE-7841 Project: Lucene - Core Issue Type: Improvement Reporter: Andriy Rysin Priority: Minor Letter ґ was re-introduced into Ukrainian alphabet in 1990 and many Ukrainian texts don't use this letter consistently so the search will benefit if we normalize it to г. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency
[ https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974717#comment-15974717 ] Andriy Rysin commented on LUCENE-7785: -- Thanks Dawid! Thanks everybody for your help and feedback! > Move dictionary for Ukrainian analyzer to external dependency > - > > Key: LUCENE-7785 > URL: https://issues.apache.org/jira/browse/LUCENE-7785 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Assignee: Dawid Weiss >Priority: Minor > Fix For: 6.x, master (7.0), 6.6 > > > Currently the dictionary for Ukrainian analyzer is a blob in the source tree. > We should move it out to external dependency, this allows: > * to have less binaries in the source > * easier to update the dictionary and track updates -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency
[ https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969418#comment-15969418 ] Andriy Rysin commented on LUCENE-7785: -- `ant precommit` is happy now > Move dictionary for Ukrainian analyzer to external dependency > - > > Key: LUCENE-7785 > URL: https://issues.apache.org/jira/browse/LUCENE-7785 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Assignee: Dawid Weiss > > Currently the dictionary for Ukrainian analyzer is a blob in the source tree. > We should move it out to external dependency, this allows: > * to have less binaries in the source > * easier to update the dictionary and track updates -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency
[ https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969405#comment-15969405 ] Andriy Rysin commented on LUCENE-7785: -- Ahh, I see what you mean, I'll push the fix for the order once my `ant precommit` succeeds. > Move dictionary for Ukrainian analyzer to external dependency > - > > Key: LUCENE-7785 > URL: https://issues.apache.org/jira/browse/LUCENE-7785 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Assignee: Dawid Weiss > > Currently the dictionary for Ukrainian analyzer is a blob in the source tree. > We should move it out to external dependency, this allows: > * to have less binaries in the source > * easier to update the dictionary and track updates -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency
[ https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969389#comment-15969389 ] Andriy Rysin commented on LUCENE-7785: -- Here's what I get: check-lib-versions: [echo] Lib versions check under: /home/master/work/ukr/spelling/lucene-workspace/lucene-solr/lucene/.. [libversions] :: loading settings :: file = /home/master/work/ukr/spelling/lucene-workspace/lucene-solr/lucene/top-level-ivy-settings.xml [libversions] OUT-OF-ORDER coordinate key '/org.ccil.cowan.tagsoup/tagsoup' in ivy-versions.properties [libversions] Checked that ivy-versions.properties and ivy-ignore-conflicts.properties have lexically sorted '/org/name' keys and no duplicates or orphans. [libversions] Scanned 46 ivy.xml files for rev="${/org/name}" format. [libversions] Found 0 indirect dependency version conflicts. [libversions] Completed in 1.24s., 1 error(s). > Move dictionary for Ukrainian analyzer to external dependency > - > > Key: LUCENE-7785 > URL: https://issues.apache.org/jira/browse/LUCENE-7785 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Assignee: Dawid Weiss > > Currently the dictionary for Ukrainian analyzer is a blob in the source tree. > We should move it out to external dependency, this allows: > * to have less binaries in the source > * easier to update the dictionary and track updates -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency
[ https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969367#comment-15969367 ] Andriy Rysin commented on LUCENE-7785: -- Ok, thanks for the suggestions, I was able to run `ant precommit` and I've added/adjusted the files to make it happier. It still fails for me but dues to some issue with `/org.ccil.cowan.tagsoup/tagsoup`, hopefully files related to this issue are good now. > Move dictionary for Ukrainian analyzer to external dependency > - > > Key: LUCENE-7785 > URL: https://issues.apache.org/jira/browse/LUCENE-7785 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Andriy Rysin >Assignee: Dawid Weiss > > Currently the dictionary for Ukrainian analyzer is a blob in the source tree. > We should move it out to external dependency, this allows: > * to have less binaries in the source > * easier to update the dictionary and track updates -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency
Andriy Rysin created LUCENE-7785: Summary: Move dictionary for Ukrainian analyzer to external dependency Key: LUCENE-7785 URL: https://issues.apache.org/jira/browse/LUCENE-7785 Project: Lucene - Core Issue Type: Improvement Reporter: Andriy Rysin Currently the dictionary for Ukrainian analyzer is a blob in the source tree. We should move it out to external dependency, this allows: * to have less binaries in the source * easier to update the dictionary and track updates -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628837#comment-15628837 ] Andriy Rysin commented on LUCENE-7287: -- Cassandra looks like 6.2 is out could you please add Ukrainian section to https://cwiki.apache.org/confluence/display/solr/Language+Analysis ? > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365063#comment-15365063 ] Andriy Rysin commented on LUCENE-7287: -- Thanks Michael, much appreciated! > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358188#comment-15358188 ] Andriy Rysin commented on LUCENE-7287: -- Hey [~mikemccand], can we please merge the pull request above, that should wrap up dictionary-based analyzer for Ukrainian. Thanks! > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348927#comment-15348927 ] Andriy Rysin commented on LUCENE-7287: -- Ok, I was able to run solr with Ukrainian analyzer and I can confirm it generates unique lemmas. I've created a pull request https://github.com/apache/lucene-solr/pull/45 I've also added mapping_uk.txt so we can use mapping filter in solr, once it's merged we can add this line: We could potentially change UkrainianMorfologikAnalyzer to use MappingCharFilterFactory to read from the same file (so we don't have the mapping both in the code and the file) but not sure how appropriate using of factories in lucene is. Many thanks to Ahmet who helped with solr integration and found duplicate tokens! > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348585#comment-15348585 ] Andriy Rysin commented on LUCENE-7287: -- I've created the dictionary that collapses token+lemma in one record (like Polish dictionary does) and added tests to make sure we don't generate duplicate lemmas. I'll do a bit more testing and will create a pull request. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347153#comment-15347153 ] Andriy Rysin commented on LUCENE-7287: -- Ok, then I'll prepare the changes as part of this ticket. I've looked deeper into the morfologik dictionaries we have in LanguageTool and the Polish one has token+lemma normalized (with POS tags concatenated for each unique token+lemma), other dictionaries including Ukrainian have separate records thus token+lemma is not unique. I've sent an email to the morfologik guys and once I get an explanation I'll update the dictionary appropriately so we don't have have duplicates. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346878#comment-15346878 ] Andriy Rysin commented on LUCENE-7287: -- Hmm, that does not look right. Yes we can either use RemoveDuplicatesTokenFilterFactory (we'll have to add that to the UkrainianMorfologikAnalyzer too) or I need to rebuild the dictionary to remove the duplicates (probably preferred way). The problem is that currently the dictionary is the POS dictionary so there may be duplicate lemma records as long as the POS tags are different. I am thinking to file new jira issue for that and will provide a pull request, does that make sense? > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346862#comment-15346862 ] Andriy Rysin commented on LUCENE-7287: -- Thanks Ahmet! Shall I create mappings_uk.txt so we can use it in solr? As for the multiple tokens, MorfologikFilter produces lemmas so (how I understand) it may have multiple tokens in the output for single token in the input. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346413#comment-15346413 ] Andriy Rysin commented on LUCENE-7287: -- Sure, I can add a comment, but I guess I need to test the solution first and as I am not familiar with solr so it may take me few days. Unless [~iorixxx] already verified this solution then we can just post it. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344396#comment-15344396 ] Andriy Rysin commented on LUCENE-7287: -- I've logged in into cwiki but I don't seem to have rights to edit the page. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344252#comment-15344252 ] Andriy Rysin commented on LUCENE-7287: -- Thanks Ahmet, that looks good! Would you add/push those changes or shall I work on this? > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343258#comment-15343258 ] Andriy Rysin edited comment on LUCENE-7287 at 6/22/16 3:07 AM: --- I don't know much about solr, but I think MorfologikFilterFactory uses dictionary= parameter instead of dictionary-resource= https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html Also would that mean that we don't get the stop words filter and apostrophe/stress character normalization? was (Author: arysin): I don't know much about solr, but I think MorfologikFilterFactory uses dictionary= parameter instead of dictionary-resource= https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343258#comment-15343258 ] Andriy Rysin commented on LUCENE-7287: -- I don't know much about solr, but I think MorfologikFilterFactory uses dictionary= parameter instead of dictionary-resource= https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7348) Add dynamic stemmer for Ukrainian
[ https://issues.apache.org/jira/browse/LUCENE-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341014#comment-15341014 ] Andriy Rysin commented on LUCENE-7348: -- [~mikemccand] Hey Michael, I've analyzed the inflection rules we have in dict_uk project (https://github.com/arysin/dict_uk) and it has ~4500 inflection rules (most of those are simple match but some are regexps). Those rules cover almost all possible affixes. I can probably drop rare and homonimic ones to make it below 4k but then the question comes up where to go next? 1) having all the rules would be nice as it'll provide high accuracy and high level of compatibility with the dictionary-based lemmatizer created in LUCENE-7287 (we could probably even make a hybrid solution) 2) having smaller/simpler will benefit the performance (but to simplify it properly we would have to analyze the frequency/importance of each rule) 3) is lemmatizing analysis good or stemming is preferred? for real stemming we would have to work more on the rules to find the (pseudo)roots for each inflection rule I tried to look at existing light stemmers and many are very basic. It looks like we're going in reverse and I am trying to understand if already having complex solution we want to make it simpler (it looks that the only benefit will be performance)? I also tried to google on how to do the stemming "right" but nothing serious jumped at me especially applicable for Slavic languages. Thanks. > Add dynamic stemmer for Ukrainian > - > > Key: LUCENE-7348 > URL: https://issues.apache.org/jira/browse/LUCENE-7348 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Andriy Rysin >Priority: Minor > Labels: analysis, language > > We're adding a dictionary based lemmatizing analyzer for Ukrainian in > https://issues.apache.org/jira/browse/LUCENE-7287. > It would be nice to have a dynamic stemmer that can handle words that are not > in the dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7348) Add dynamic stemmer for Ukrainian
Andriy Rysin created LUCENE-7348: Summary: Add dynamic stemmer for Ukrainian Key: LUCENE-7348 URL: https://issues.apache.org/jira/browse/LUCENE-7348 Project: Lucene - Core Issue Type: New Feature Components: modules/analysis Reporter: Andriy Rysin Priority: Minor We're adding a dictionary based lemmatizing analyzer for Ukrainian in https://issues.apache.org/jira/browse/LUCENE-7287. It would be nice to have a dynamic stemmer that can handle words that are not in the dictionary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340996#comment-15340996 ] Andriy Rysin commented on LUCENE-7287: -- Looks cool, thanks a lot Michael! I wonder if we should add little javadoc for this analyzer that it's dictionary based so if we add a light-stemming analyzer users can easily tell the difference. Also since I created a project I've updated the dictionary once (https://github.com/arysin/lucene_uk/commit/7cc8bea59c402e9b9729afd63d0a53cb34045e750) not sure if you're using the latest update. I'll open another issue for the "light" stemmer for Ukrainian. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > Attachments: LUCENE-7287.patch > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337982#comment-15337982 ] Andriy Rysin commented on LUCENE-7287: -- I guess it does not fit under analysis/common as it depends on Morfologik so analysis/ukrainian is probably a good place. Or we could put it under analisys/morfologik (as a .uk subpackage) - it's your call. If we do that will the stopwords go with the stemmer or should they live under common/ (as they are not morfologik-specific and may be used for other Ukrainian implementations)? I am also thinking if we could build generic stemmer for Ukrainian based on the affix rules we have in dict_uk project (they are hunspell-like but fully based on regular expressions which makes them way more compact). > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333888#comment-15333888 ] Andriy Rysin commented on LUCENE-7287: -- [~mikemccand], [~iorixxx] does this implementation look good enough for inclusion? Is there anything else needs to be done? Thanks. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326753#comment-15326753 ] Andriy Rysin commented on LUCENE-7287: -- Thanks for the hint, I've changed the code to use MappingCharFilter. It's slightly slower but architecturally more correct. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326624#comment-15326624 ] Andriy Rysin commented on LUCENE-7287: -- I've added a token filter for unicode apostrophes and stress symbol. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326066#comment-15326066 ] Andriy Rysin commented on LUCENE-7287: -- Ok, guys, I've created little project with Ukrainian analyzer for lucene using MorfologikAnalyzer: https://github.com/arysin/lucene_uk The test (https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/apache/lucene/analysis/uk/TestUkrainianAnalyzer.java) runs successfully inside lucene but I can't run it in my project (getting NPE at RunListenerPrintReproduceInfo.java:131). I can run simple standalone test app though with no problem: https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/lucene_uk/test/LuceneTest.java For simplicity for now I just included Ukrainian binary morfologik dictionary in the project itself. The only currently published artifact with Ukrainian dictionary is http://mvnrepository.com/artifact/org.languagetool/language-uk but it requires languagetool-core and dragging it into lucene probably does not make sense. If the PoC is good enough I can take a shot at creating separate artifact with just a dictionary (this may take some time) or we can just live with the blob in lucene. I would appreciate if you can take a look and let me know how it looks. If it's acceptable I would need to work on including some of the goodies from Dmytro's project: handling different apostrophes and ignoring accent character. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323800#comment-15323800 ] Andriy Rysin commented on LUCENE-7287: -- Ok, I've imported lucene-sorl and the Ukrainian analyzer project from [~mr_gambal] into Eclipse and looked through the code. Unfortunately we can't use the whole morfologik package as is - it's very specific for Polish. We could still probably use part of morfologik for compact dictionary representation. The whole Ukrainian dictionary in this format with POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller if we strip the tags. There are several things I'd like to note: 1) this dictionary is for inflections (not related words) so this stemming will be producing lemmas not quite root words (this is probably ok and in some cases even better?) 2) as this is dictionary-based stemming it won't stem unknown words (but dictionary contains ~200K lemmas so it should give good output) 3) as Ukrainian has high level of inflection (nouns produce up to 7 forms, direct verbs up to 20, reverse verbs up to 30 forms) with many rules and exceptions developing quality rule-base stemming will not be trivial 4) I was planning to work on Ukrainian analyzer in a separate project but if it's better for the review process I can fork lucene-solr and work inside the fork 5) I am thinking to create org.apache.lucene.analysis.uk classes based on [~mr_gambal]'s work and the csv file we have and once it's working try more compact representation The question: once we have it working shall we include the dictionary in the lucene project or make it an external dependency (like with morfologik-polish.jar)? First is simpler but second will allow easy updates for the dictionary (which I can see being actively developed for another year or two) and also will keep the binary blob out of the project. I am leaning towards second but open for discussion. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317807#comment-15317807 ] Andriy Rysin commented on LUCENE-7287: -- I just realized that Lucene includes morfologik analyzer (https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/morfologik/MorfologikAnalyzer.java). We already use the Ukrainian dictionary in morfologik format for LanguageTool (https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/resources/org/languagetool/resource/uk/ukrainian.dict). It's about 1.6MB in file and should be quite fast and memory efficient. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306947#comment-15306947 ] Andriy Rysin commented on LUCENE-7287: -- >From my point of view we can use dict_uk as a source for lucene (and we can >provide acceptable license). The question is whether we need hunspell data >with affixes that are based on lemmas (a bit more work) or we can get away >with flat file as suggested by [~iorixxx] (this we can do pretty quickly). > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304009#comment-15304009 ] Andriy Rysin commented on LUCENE-7287: -- BTW how does hunspell stemming works for "exceptions"? There are bunch of words in Ukrainian whose inflections is hard to put in hunspell affix rules. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304000#comment-15304000 ] Andriy Rysin commented on LUCENE-7287: -- So do we need to build hunspell dictionary (this may take me some time, probably a week or two) or using StemmerOverrideFilter with existing dictionary as suggested by Ahmet is good enough? BTW older Ukrainian hunspell used in http://github.com/elastic/hunspell is not very suitable as it's "too compact" - it often combines multiple lemmas together (most frequently direct and reverse verbs, adjectives and adverbs etc). > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298680#comment-15298680 ] Andriy Rysin commented on LUCENE-7287: -- There's no alternative open dictionary for Ukrainian with acceptable quality (I know since I've been working on it for last 10 years :)). But I can relicense the https://github.com/arysin/dict_uk or the derivatives under MIT if it helps. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.
[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298470#comment-15298470 ] Andriy Rysin commented on LUCENE-7287: -- Quick check via jvisualvm shows ~400MB used by the dictionary map. The dictionary originally is coming from https://github.com/arysin/dict_uk, that project developed from Ukrainian hunspell dictionary (which was very compact: on average each hunspell flag was producing 12 words) but diverged a bit and now the system of affixes in dict_uk is not compatible with that in hunspell. I have on my TODO to add a convertor to produce hunspell dictionary from dict_uk sources. If that helps here (I'm not familiar with hunspell token filter in lucene) I could put it a bit higher in my priority. > New lemma-tizer plugin for ukrainian language. > -- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis >Reporter: Dmytro Hambal >Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org