[
https://issues.apache.org/jira/browse/LUCENE-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157842#comment-14157842
]
Robert Muir commented on LUCENE-5986:
-------------------------------------
By the way, here is the paper:
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
Its referenced in the source code: this algorithm just implements the paper.
Its not about opinions of what is right and what is wrong and what is good and
what is bad.
> Incorrect character folding in Arabic
> -------------------------------------
>
> Key: LUCENE-5986
> URL: https://issues.apache.org/jira/browse/LUCENE-5986
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Jorge Cruanes
> Labels: easyfix
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> The function {{normalize(char s[], int len)}}, in the package
> {{org.apache.lucene.analysis.ar.ArabicNormalizer}}, makes an incorrect
> character folding in Arabic. The incorrect folding affects the letters Teh
> Marbuta (U+0629) and Heh (U+0647) at the end of a word (according to the
> study of El-Sherbiny et al., 2010, page 5).
> To fix this bug the solution is inserting an if clause, where the folding is
> made only an if the Teh Marbuta is not at the end of the word. Suggestion for
> the new case code is following:
> {quote}
> case TEH_MARBUTA:
> if (i < (len-1))
> s [ i ] = HEH;
> break;
> {quote}
> References:
> El-Sherbiny, A., Farah, M., Oueichek, I., Al-Zoman, A. (2010) Linguistic
> Guidelines for the Use of the Arabic Language in Internet Domains. Internet
> Society Requests For Comment (RFCs) (5564). pp 1-11. Available at:
> http://tools.ietf.org/html/rfc5564.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]