[
https://issues.apache.org/jira/browse/LUCENE-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588901#comment-16588901
]
Robert Muir commented on LUCENE-8462:
-------------------------------------
This change looks great, however i'm wondering if we can produce an alternative
vocabulary list for test purposes?
1. This list is *huge*. I think it may be autogenerated (morph generation or
something). It causes our test data to jump from 2MB to 28MB. In general we
just want a simple list to catch us if we introduce some bug. All the other
languages combined are 2MB compressed, the arabic one is 26MB compressed...
2. Unlike all the other snowball test data, this arabic vocabulary list is
explicitly labeled as GPL. I'm personally not comfortable committing GPL stuff
(even test data .txt files) without asking for more guidance first. But maybe
we can avoid this problem since I don't think we want such an enormous list.
> New Arabic snowball stemmer
> ---------------------------
>
> Key: LUCENE-8462
> URL: https://issues.apache.org/jira/browse/LUCENE-8462
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Ryadh Dahimene
> Priority: Trivial
> Labels: Arabic, snowball, stemmer
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Added a new Arabic snowball stemmer based on
> [https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl]
> As well an Arabic test dataset in `TestSnowballVocabData.zip` from the
> snowball-data available here
> [https://github.com/snowballstem/snowball-data/tree/master/arabic]
> Link to the corresponding Github PR:
> [https://github.com/apache/lucene-solr/pull/439]
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]