[ 
https://issues.apache.org/jira/browse/LUCENE-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646314#comment-16646314
 ] 

Ryadh Dahimene commented on LUCENE-8462:
----------------------------------------

Hi lucene team,
Just a quick summary of the state of this change. In this version, the 
contributed snowball Arabic Stemmer has been generated using the `ant 
patch-snowball` task. To achieve that, the ant task has been updated and it is 
now compatible with the last version of snowball (revision 
1964ce688cbeca505263c8f77e16ed923296ce7a) and also retro-compatible with the 
revision of the Snowball repository currently used by Lucene) In my opinion, 
this change is now ready and will allow users to use the new Arabic snowball 
stemmer.

In the longer term view, I believe that it will be better if all the lucene 
snowball stemmers are synced with the last version of the snowball stemmers 
(https://github.com/snowballstem/snowball). This will allow a smoother 
integration of newly added languages as well as the updated ones and will 
reduce the complexity of the `ant patch-snowball` task. The current version 
used is based on revision 502 of the Tartarus Snowball repository 
(https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b)
 and it is now more than 10 years old.

It is a wider change in the sense that the impacts have yet to be assessed, but 
if the team believe that it is relevant and see value in it, I'll be happy to 
invest some time in this task.

> New Arabic snowball stemmer
> ---------------------------
>
>                 Key: LUCENE-8462
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8462
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ryadh Dahimene
>            Priority: Trivial
>              Labels: Arabic, snowball, stemmer
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Added a new Arabic snowball stemmer based on 
> [https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl]
> As well an Arabic test dataset in `TestSnowballVocabData.zip` from the 
> -snowball-data- generated from the input file available here 
> -[https://github.com/snowballstem/snowball-data/tree/master/arabic]-
> [https://github.com/ibnmalik/golden-corpus-arabic/blob/develop/core/words.txt]
>  
> It also updates the {{ant patch-snowball}} target to be compatible with
> the java classes generated by the last snowball version (tree:
> 1964ce688cbeca505263c8f77e16ed923296ce7a). The {{ant patch-snowball}} target
> is retro-compatible with the version of snowball stemmers used in
> lucene 7.x and ignores already patched classes.
>  
> Link to the corresponding Github PR:
> [https://github.com/apache/lucene-solr/pull/449]
>  Edited: updated the corpus link, PR link and description
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to