[ 
https://issues.apache.org/jira/browse/LUCENE-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588901#comment-16588901
 ] 

Robert Muir commented on LUCENE-8462:
-------------------------------------

This change looks great, however i'm wondering if we can produce an alternative 
vocabulary list for test purposes?
1. This list is *huge*. I think it may be autogenerated (morph generation or 
something). It causes our test data to jump from 2MB to 28MB. In general we 
just want a simple list to catch us if we introduce some bug. All the other 
languages combined are 2MB compressed, the arabic one is 26MB compressed...
2. Unlike all the other snowball test data, this arabic vocabulary list is 
explicitly labeled as GPL. I'm personally not comfortable committing GPL stuff 
(even test data .txt files) without asking for more guidance first. But maybe 
we can avoid this problem since I don't think we want such an enormous list.

> New Arabic snowball stemmer
> ---------------------------
>
>                 Key: LUCENE-8462
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8462
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ryadh Dahimene
>            Priority: Trivial
>              Labels: Arabic, snowball, stemmer
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Added a new Arabic snowball stemmer based on 
> [https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl]
> As well an Arabic test dataset in `TestSnowballVocabData.zip` from the 
> snowball-data available here 
> [https://github.com/snowballstem/snowball-data/tree/master/arabic]
> Link to the corresponding Github PR:
> [https://github.com/apache/lucene-solr/pull/439]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to