[
https://issues.apache.org/jira/browse/LUCENE-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646314#comment-16646314
]
Ryadh Dahimene commented on LUCENE-8462:
----------------------------------------
Hi lucene team,
Just a quick summary of the state of this change. In this version, the
contributed snowball Arabic Stemmer has been generated using the `ant
patch-snowball` task. To achieve that, the ant task has been updated and it is
now compatible with the last version of snowball (revision
1964ce688cbeca505263c8f77e16ed923296ce7a) and also retro-compatible with the
revision of the Snowball repository currently used by Lucene) In my opinion,
this change is now ready and will allow users to use the new Arabic snowball
stemmer.
In the longer term view, I believe that it will be better if all the lucene
snowball stemmers are synced with the last version of the snowball stemmers
(https://github.com/snowballstem/snowball). This will allow a smoother
integration of newly added languages as well as the updated ones and will
reduce the complexity of the `ant patch-snowball` task. The current version
used is based on revision 502 of the Tartarus Snowball repository
(https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b)
and it is now more than 10 years old.
It is a wider change in the sense that the impacts have yet to be assessed, but
if the team believe that it is relevant and see value in it, I'll be happy to
invest some time in this task.
> New Arabic snowball stemmer
> ---------------------------
>
> Key: LUCENE-8462
> URL: https://issues.apache.org/jira/browse/LUCENE-8462
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Ryadh Dahimene
> Priority: Trivial
> Labels: Arabic, snowball, stemmer
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Added a new Arabic snowball stemmer based on
> [https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl]
> As well an Arabic test dataset in `TestSnowballVocabData.zip` from the
> -snowball-data- generated from the input file available here
> -[https://github.com/snowballstem/snowball-data/tree/master/arabic]-
> [https://github.com/ibnmalik/golden-corpus-arabic/blob/develop/core/words.txt]
>
> It also updates the {{ant patch-snowball}} target to be compatible with
> the java classes generated by the last snowball version (tree:
> 1964ce688cbeca505263c8f77e16ed923296ce7a). The {{ant patch-snowball}} target
> is retro-compatible with the version of snowball stemmers used in
> lucene 7.x and ignores already patched classes.
>
> Link to the corresponding Github PR:
> [https://github.com/apache/lucene-solr/pull/449]
> Edited: updated the corpus link, PR link and description
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]