[ 
https://issues.apache.org/jira/browse/LUCENE-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157842#comment-14157842
 ] 

Robert Muir commented on LUCENE-5986:
-------------------------------------

By the way, here is the paper: 
http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf

Its referenced in the source code: this algorithm just implements the paper. 
Its not about opinions of what is right and what is wrong and what is good and 
what is bad.

> Incorrect character folding in Arabic
> -------------------------------------
>
>                 Key: LUCENE-5986
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5986
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jorge Cruanes
>              Labels: easyfix
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> The function {{normalize(char s[], int len)}}, in the package 
> {{org.apache.lucene.analysis.ar.ArabicNormalizer}}, makes an incorrect  
> character folding in Arabic. The incorrect folding affects the letters Teh 
> Marbuta (U+0629) and Heh (U+0647) at the end of a word (according to the 
> study of El-Sherbiny et al., 2010, page 5).
> To fix this bug the solution is inserting an if clause, where the folding is 
> made only an if the Teh Marbuta is not at the end of the word. Suggestion for 
> the new case code is following:
> {quote}
> case TEH_MARBUTA:
>   if (i < (len-1))
>     s [ i ] = HEH;
>   break;
> {quote}
> References:
> El-Sherbiny, A., Farah, M., Oueichek, I., Al-Zoman, A. (2010) Linguistic 
> Guidelines for the Use of the Arabic Language in Internet Domains. Internet 
> Society Requests For Comment (RFCs) (5564). pp 1-11. Available at: 
> http://tools.ietf.org/html/rfc5564.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to