[jira] [Created] (SOLR-11835) Adjust instructions for Ukrainian on LanguageAnalysis page

2018-01-09 Thread Andriy Rysin (JIRA)
Andriy Rysin created SOLR-11835:
---

 Summary: Adjust instructions for Ukrainian on LanguageAnalysis page
 Key: SOLR-11835
 URL: https://issues.apache.org/jira/browse/SOLR-11835
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Andriy Rysin


Since Lucene 6.6 the dictionary for Ukrainian analyzer contains all proper 
names in lowercase, this seems to be much better way to have it for searching.
Can we please move LowerCaseFilterFactory back before MorfologikFilterFactory 
at 
https://lucene.apache.org/solr/guide/6_6/language-analysis.html#LanguageAnalysis-Ukrainian?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7973) Update dictionary version for Ukrainian analyzer to 3.9.0

2017-10-07 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195876#comment-16195876
 ] 

Andriy Rysin commented on LUCENE-7973:
--

It looks like I need to remove my 3 pull requests from above now, right?

> Update dictionary version for Ukrainian analyzer to 3.9.0
> -
>
> Key: LUCENE-7973
> URL: https://issues.apache.org/jira/browse/LUCENE-7973
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 7.1
>
>
> Update morfologik dictionary version to 3.9.0 for Ukrainian analyzer.
> There's 60K of new lemmas there along with some other improvements and fixes, 
> particularly Ukrainian town names have been synchronized with official 
> standard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7973) Update dictionary version for Urainian analyzer

2017-09-20 Thread Andriy Rysin (JIRA)
Andriy Rysin created LUCENE-7973:


 Summary: Update dictionary version for Urainian analyzer
 Key: LUCENE-7973
 URL: https://issues.apache.org/jira/browse/LUCENE-7973
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Andriy Rysin


Update morfologik dictionary version to 3.9.0 for Ukrainian analyzer.
There's 60K of new lemmas there along with some other improvements and fixes, 
particularly Ukrainian town names have been synchronized with official standard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7841) Normalize ґ to г in Ukrainian analyzer

2017-05-23 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021165#comment-16021165
 ] 

Andriy Rysin commented on LUCENE-7841:
--

Thanks Dawid, I've pushed the checksum and change file changes on all 3 tracks, 
`ant precommit` runs clean now.

> Normalize ґ to г in Ukrainian analyzer
> --
>
> Key: LUCENE-7841
> URL: https://issues.apache.org/jira/browse/LUCENE-7841
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Priority: Minor
>
> Letter ґ was re-introduced into Ukrainian alphabet in 1990 and many Ukrainian 
> texts don't use this letter consistently so the search will benefit if we 
> normalize it to г. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7841) Normalize ґ to г in Ukrainian analyzer

2017-05-19 Thread Andriy Rysin (JIRA)
Andriy Rysin created LUCENE-7841:


 Summary: Normalize ґ to г in Ukrainian analyzer
 Key: LUCENE-7841
 URL: https://issues.apache.org/jira/browse/LUCENE-7841
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Andriy Rysin
Priority: Minor


Letter ґ was re-introduced into Ukrainian alphabet in 1990 and many Ukrainian 
texts don't use this letter consistently so the search will benefit if we 
normalize it to г. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency

2017-04-19 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974717#comment-15974717
 ] 

Andriy Rysin commented on LUCENE-7785:
--

Thanks Dawid! Thanks everybody for your help and feedback!

> Move dictionary for Ukrainian analyzer to external dependency
> -
>
> Key: LUCENE-7785
> URL: https://issues.apache.org/jira/browse/LUCENE-7785
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 6.x, master (7.0), 6.6
>
>
> Currently the dictionary for Ukrainian analyzer is a blob in the source tree. 
> We should move it out to external dependency, this allows:
> * to have less binaries in the source
> * easier to update the dictionary and track updates



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency

2017-04-14 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969418#comment-15969418
 ] 

Andriy Rysin commented on LUCENE-7785:
--

`ant precommit` is happy now

> Move dictionary for Ukrainian analyzer to external dependency
> -
>
> Key: LUCENE-7785
> URL: https://issues.apache.org/jira/browse/LUCENE-7785
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Assignee: Dawid Weiss
>
> Currently the dictionary for Ukrainian analyzer is a blob in the source tree. 
> We should move it out to external dependency, this allows:
> * to have less binaries in the source
> * easier to update the dictionary and track updates



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency

2017-04-14 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969405#comment-15969405
 ] 

Andriy Rysin commented on LUCENE-7785:
--

Ahh, I see what you mean, I'll push the fix for the order once my `ant 
precommit` succeeds.

> Move dictionary for Ukrainian analyzer to external dependency
> -
>
> Key: LUCENE-7785
> URL: https://issues.apache.org/jira/browse/LUCENE-7785
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Assignee: Dawid Weiss
>
> Currently the dictionary for Ukrainian analyzer is a blob in the source tree. 
> We should move it out to external dependency, this allows:
> * to have less binaries in the source
> * easier to update the dictionary and track updates



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency

2017-04-14 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969389#comment-15969389
 ] 

Andriy Rysin commented on LUCENE-7785:
--

Here's what I get:
check-lib-versions:
 [echo] Lib versions check under: 
/home/master/work/ukr/spelling/lucene-workspace/lucene-solr/lucene/..
[libversions] :: loading settings :: file = 
/home/master/work/ukr/spelling/lucene-workspace/lucene-solr/lucene/top-level-ivy-settings.xml
[libversions] OUT-OF-ORDER coordinate key '/org.ccil.cowan.tagsoup/tagsoup' in 
ivy-versions.properties
[libversions] Checked that ivy-versions.properties and 
ivy-ignore-conflicts.properties have lexically sorted '/org/name' keys and no 
duplicates or orphans.
[libversions] Scanned 46 ivy.xml files for rev="${/org/name}" format.
[libversions] Found 0 indirect dependency version conflicts.
[libversions] Completed in 1.24s., 1 error(s).


> Move dictionary for Ukrainian analyzer to external dependency
> -
>
> Key: LUCENE-7785
> URL: https://issues.apache.org/jira/browse/LUCENE-7785
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Assignee: Dawid Weiss
>
> Currently the dictionary for Ukrainian analyzer is a blob in the source tree. 
> We should move it out to external dependency, this allows:
> * to have less binaries in the source
> * easier to update the dictionary and track updates



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency

2017-04-14 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969367#comment-15969367
 ] 

Andriy Rysin commented on LUCENE-7785:
--

Ok, thanks for the suggestions, I was able to run `ant precommit` and I've 
added/adjusted the files to make it happier.
It still fails for me but dues to some issue with 
`/org.ccil.cowan.tagsoup/tagsoup`, hopefully files related to this issue are 
good now.

> Move dictionary for Ukrainian analyzer to external dependency
> -
>
> Key: LUCENE-7785
> URL: https://issues.apache.org/jira/browse/LUCENE-7785
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Andriy Rysin
>Assignee: Dawid Weiss
>
> Currently the dictionary for Ukrainian analyzer is a blob in the source tree. 
> We should move it out to external dependency, this allows:
> * to have less binaries in the source
> * easier to update the dictionary and track updates



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7785) Move dictionary for Ukrainian analyzer to external dependency

2017-04-13 Thread Andriy Rysin (JIRA)
Andriy Rysin created LUCENE-7785:


 Summary: Move dictionary for Ukrainian analyzer to external 
dependency
 Key: LUCENE-7785
 URL: https://issues.apache.org/jira/browse/LUCENE-7785
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Andriy Rysin


Currently the dictionary for Ukrainian analyzer is a blob in the source tree. 
We should move it out to external dependency, this allows:
* to have less binaries in the source
* easier to update the dictionary and track updates



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-11-02 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628837#comment-15628837
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Cassandra looks like 6.2 is out could you please add Ukrainian section to 
https://cwiki.apache.org/confluence/display/solr/Language+Analysis ?

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-07-06 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365063#comment-15365063
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Thanks Michael, much appreciated!

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-30 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358188#comment-15358188
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Hey [~mikemccand], can we please merge the pull request above, that should wrap 
up dictionary-based analyzer for Ukrainian. Thanks!

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-24 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348927#comment-15348927
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Ok, I was able to run solr with Ukrainian analyzer and I can confirm it 
generates unique lemmas.
I've created a pull request https://github.com/apache/lucene-solr/pull/45

I've also added mapping_uk.txt so we can use mapping filter in solr, once it's 
merged we can add this line:


We could potentially change UkrainianMorfologikAnalyzer to use 
MappingCharFilterFactory to read from the same file (so we don't have the 
mapping both in the code and the file) but not sure how appropriate using of 
factories in lucene is.

Many thanks to Ahmet who helped with solr integration and found duplicate 
tokens!

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-24 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348585#comment-15348585
 ] 

Andriy Rysin commented on LUCENE-7287:
--

I've created the dictionary that collapses token+lemma in one record (like 
Polish dictionary does) and added tests to make sure we don't generate 
duplicate lemmas.
I'll do a bit more testing and will create a pull request.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-23 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15347153#comment-15347153
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Ok, then I'll prepare the changes as part of this ticket.

I've looked deeper into the morfologik dictionaries we have in LanguageTool and 
the Polish one has token+lemma normalized (with POS tags concatenated for each 
unique token+lemma), other dictionaries including Ukrainian have separate 
records thus token+lemma is not unique. I've sent an email to the morfologik 
guys and once I get an explanation I'll update the dictionary appropriately so 
we don't have have duplicates.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-23 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346878#comment-15346878
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Hmm, that does not look right. Yes we can either use 
RemoveDuplicatesTokenFilterFactory (we'll have to add that to the 
UkrainianMorfologikAnalyzer too) or I need to rebuild the dictionary to remove 
the duplicates (probably preferred way).
The problem is that currently the dictionary is the POS dictionary so there may 
be duplicate lemma records as long as the POS tags are different.
I am thinking to file new jira issue for that and will provide a pull request, 
does that make sense?

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-23 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346862#comment-15346862
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Thanks Ahmet!
Shall I create mappings_uk.txt so we can use it in solr?
As for the multiple tokens, MorfologikFilter produces lemmas so (how I 
understand) it may have multiple tokens in the output for single token in the 
input.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-23 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346413#comment-15346413
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Sure, I can add a comment, but I guess I need to test the solution first and as 
I am not familiar with solr so it may take me few days. Unless [~iorixxx] 
already verified this solution then we can just post it.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-22 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344396#comment-15344396
 ] 

Andriy Rysin commented on LUCENE-7287:
--

I've logged in into cwiki but I don't seem to have rights to edit the page.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-22 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15344252#comment-15344252
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Thanks Ahmet, that looks good! Would you add/push those changes or shall I work 
on this?

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-21 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343258#comment-15343258
 ] 

Andriy Rysin edited comment on LUCENE-7287 at 6/22/16 3:07 AM:
---

I don't know much about solr, but I think MorfologikFilterFactory uses 
dictionary= parameter instead of dictionary-resource=
https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html

Also would that mean that we don't get the stop words filter and 
apostrophe/stress character normalization?


was (Author: arysin):
I don't know much about solr, but I think MorfologikFilterFactory uses 
dictionary= parameter instead of dictionary-resource=
https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-21 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15343258#comment-15343258
 ] 

Andriy Rysin commented on LUCENE-7287:
--

I don't know much about solr, but I think MorfologikFilterFactory uses 
dictionary= parameter instead of dictionary-resource=
https://lucene.apache.org/core/6_1_0/analyzers-morfologik/org/apache/lucene/analysis/morfologik/MorfologikFilterFactory.html

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Fix For: master (7.0), 6.2
>
> Attachments: LUCENE-7287.patch
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7348) Add dynamic stemmer for Ukrainian

2016-06-20 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341014#comment-15341014
 ] 

Andriy Rysin commented on LUCENE-7348:
--

[~mikemccand] Hey Michael,
I've analyzed the inflection rules we have in dict_uk project 
(https://github.com/arysin/dict_uk) and it has ~4500 inflection rules (most of 
those are simple match but some are regexps). Those rules cover almost all 
possible affixes. I can probably drop rare and homonimic ones to make it below 
4k but then the question comes up where to go next?
1) having all the rules would be nice as it'll provide high accuracy and high 
level of compatibility with the dictionary-based lemmatizer created in 
LUCENE-7287 (we could probably even make a hybrid solution)
2) having smaller/simpler will benefit the performance (but to simplify it 
properly we would have to analyze the frequency/importance of each rule)
3) is lemmatizing analysis good or stemming is preferred? for real stemming we 
would have to work more on the rules to find the (pseudo)roots for each 
inflection rule

I tried to look at existing light stemmers and many are very basic. It looks 
like we're going in reverse and I am trying to understand if already having 
complex solution we want to make it simpler (it looks that the only benefit 
will be performance)? I also tried to google on how to do the stemming "right" 
but nothing serious jumped at me especially applicable for Slavic languages.

Thanks.


> Add dynamic stemmer for Ukrainian
> -
>
> Key: LUCENE-7348
> URL: https://issues.apache.org/jira/browse/LUCENE-7348
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Andriy Rysin
>Priority: Minor
>  Labels: analysis, language
>
> We're adding a dictionary based lemmatizing analyzer for Ukrainian in 
> https://issues.apache.org/jira/browse/LUCENE-7287.
> It would be nice to have a dynamic stemmer that can handle words that are not 
> in the dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7348) Add dynamic stemmer for Ukrainian

2016-06-20 Thread Andriy Rysin (JIRA)
Andriy Rysin created LUCENE-7348:


 Summary: Add dynamic stemmer for Ukrainian
 Key: LUCENE-7348
 URL: https://issues.apache.org/jira/browse/LUCENE-7348
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/analysis
Reporter: Andriy Rysin
Priority: Minor


We're adding a dictionary based lemmatizing analyzer for Ukrainian in 
https://issues.apache.org/jira/browse/LUCENE-7287.
It would be nice to have a dynamic stemmer that can handle words that are not 
in the dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-20 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340996#comment-15340996
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Looks cool, thanks a lot Michael!

I wonder if we should add little javadoc for this analyzer that it's dictionary 
based so if we add a light-stemming analyzer users can easily tell the 
difference.
Also since I created a project I've updated the dictionary once 
(https://github.com/arysin/lucene_uk/commit/7cc8bea59c402e9b9729afd63d0a53cb34045e750)
 not sure if you're using the latest update.

I'll open another issue for the "light" stemmer for Ukrainian.


> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
> Attachments: LUCENE-7287.patch
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-18 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337982#comment-15337982
 ] 

Andriy Rysin commented on LUCENE-7287:
--

I guess it does not fit under analysis/common as it depends on Morfologik so 
analysis/ukrainian is probably a good place. Or we could put it under 
analisys/morfologik (as a .uk subpackage) - it's your call. If we do that will 
the stopwords go with the stemmer or should they live under common/ (as they 
are not morfologik-specific and may be used for other Ukrainian 
implementations)?
I am also thinking if we could build generic stemmer for Ukrainian based on the 
affix rules we have in dict_uk project (they are hunspell-like but fully based 
on regular expressions which makes them way more compact).

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-16 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333888#comment-15333888
 ] 

Andriy Rysin commented on LUCENE-7287:
--

[~mikemccand], [~iorixxx] does this implementation look good enough for 
inclusion? Is there anything else needs to be done? Thanks.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-12 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326753#comment-15326753
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Thanks for the hint, I've changed the code to use MappingCharFilter.
It's slightly slower but architecturally more correct.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-12 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326624#comment-15326624
 ] 

Andriy Rysin commented on LUCENE-7287:
--

I've added a token filter for unicode apostrophes and stress symbol.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-11 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326066#comment-15326066
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Ok, guys, I've created little project with Ukrainian analyzer for lucene using 
MorfologikAnalyzer: https://github.com/arysin/lucene_uk
The test 
(https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/apache/lucene/analysis/uk/TestUkrainianAnalyzer.java)
 runs successfully inside lucene but I can't run it in my project (getting NPE 
at RunListenerPrintReproduceInfo.java:131).
I can run simple standalone test app though with no problem: 
https://github.com/arysin/lucene_uk/blob/master/src/test/java/org/lucene_uk/test/LuceneTest.java
For simplicity for now I just included Ukrainian binary morfologik dictionary 
in the project itself. The only currently published artifact with Ukrainian 
dictionary is http://mvnrepository.com/artifact/org.languagetool/language-uk 
but it requires languagetool-core and dragging it into lucene probably does not 
make sense. If the PoC is good enough I can take a shot at creating separate 
artifact with just a dictionary (this may take some time) or we can just live 
with the blob in lucene.

I would appreciate if you can take a look and let me know how it looks. If it's 
acceptable I would need to work on including some of the goodies from Dmytro's 
project: handling different apostrophes and ignoring accent character.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-09 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15323800#comment-15323800
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Ok, I've imported lucene-sorl and the Ukrainian analyzer project from 
[~mr_gambal] into Eclipse and looked through the code.
Unfortunately we can't use the whole morfologik package as is - it's very 
specific for Polish. We could still probably use part of morfologik for compact 
dictionary representation. The whole Ukrainian dictionary in this format with 
POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller 
if we strip the tags.
There are several things I'd like to note:
1) this dictionary is for inflections (not related words) so this stemming will 
be producing lemmas not quite root words (this is probably ok and in some cases 
even better?)
2) as this is dictionary-based stemming it won't stem unknown words (but 
dictionary contains ~200K lemmas so it should give good output)
3) as Ukrainian has high level of inflection (nouns produce up to 7 forms, 
direct verbs up to 20, reverse verbs up to 30 forms) with many rules and 
exceptions developing quality rule-base stemming will not be trivial
4) I was planning to work on Ukrainian analyzer in a separate project but if 
it's better for the review process I can fork lucene-solr and work inside the 
fork
5) I am thinking to create org.apache.lucene.analysis.uk classes based on 
[~mr_gambal]'s work and the csv file we have and once it's working try more 
compact representation

The question: once we have it working shall we include the dictionary in the 
lucene project or make it an external dependency (like with 
morfologik-polish.jar)? First is simpler but second will allow easy updates for 
the dictionary (which I can see being actively developed for another year or 
two) and also will keep the binary blob out of the project. I am leaning 
towards second but open for discussion.



> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-06-06 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317807#comment-15317807
 ] 

Andriy Rysin commented on LUCENE-7287:
--

I just realized that Lucene includes morfologik analyzer 
(https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/morfologik/MorfologikAnalyzer.java).
 We already use the Ukrainian dictionary in morfologik format for LanguageTool 
(https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/resources/org/languagetool/resource/uk/ukrainian.dict).
It's about 1.6MB in file and should be quite fast and memory efficient.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-05-30 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306947#comment-15306947
 ] 

Andriy Rysin commented on LUCENE-7287:
--

>From my point of view we can use dict_uk as a source for lucene (and we can 
>provide acceptable license). The question is whether we need hunspell data 
>with affixes that are based on lemmas (a bit more work) or we can get away 
>with flat file as suggested by [~iorixxx] (this we can do pretty quickly).

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-05-27 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304009#comment-15304009
 ] 

Andriy Rysin commented on LUCENE-7287:
--

BTW how does hunspell stemming works for "exceptions"? There are bunch of words 
in Ukrainian whose inflections is hard to put in hunspell affix rules.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-05-27 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304000#comment-15304000
 ] 

Andriy Rysin commented on LUCENE-7287:
--

So do we need to build hunspell dictionary (this may take me some time, 
probably a week or two) or using StemmerOverrideFilter with existing dictionary 
as suggested by Ahmet is good enough?
BTW older Ukrainian hunspell used in http://github.com/elastic/hunspell is not 
very suitable as it's "too compact" - it often combines multiple lemmas 
together (most frequently direct and reverse verbs, adjectives and adverbs etc).

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-05-24 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298680#comment-15298680
 ] 

Andriy Rysin commented on LUCENE-7287:
--

There's no alternative open dictionary for Ukrainian with acceptable quality (I 
know since I've been working on it for last 10 years :)).
But I can relicense the https://github.com/arysin/dict_uk or the derivatives 
under MIT if it helps.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7287) New lemma-tizer plugin for ukrainian language.

2016-05-24 Thread Andriy Rysin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15298470#comment-15298470
 ] 

Andriy Rysin commented on LUCENE-7287:
--

Quick check via jvisualvm shows ~400MB used by the dictionary map. The 
dictionary originally is coming from https://github.com/arysin/dict_uk, that 
project developed from Ukrainian hunspell dictionary (which was very compact: 
on average each hunspell flag was producing 12 words) but diverged a bit and 
now the system of affixes in dict_uk is not compatible with that in hunspell.
I have on my TODO to add a convertor to produce hunspell dictionary from 
dict_uk sources. If that helps here (I'm not familiar with hunspell token 
filter in lucene) I could put it a bit higher in my priority.

> New lemma-tizer plugin for ukrainian language.
> --
>
> Key: LUCENE-7287
> URL: https://issues.apache.org/jira/browse/LUCENE-7287
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Dmytro Hambal
>Priority: Minor
>  Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org