Jorge Cruanes created LUCENE-5986:
-------------------------------------

             Summary: Incorrect character folding in Arabic
                 Key: LUCENE-5986
                 URL: https://issues.apache.org/jira/browse/LUCENE-5986
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Jorge Cruanes


The function {{normalize(char s[], int len)}}, in the package 
{{org.apache.lucene.analysis.ar.ArabicNormalizer}}, makes an incorrect  
character folding in Arabic. The incorrect folding affects the letters Teh 
Marbuta (U+0629) and Heh (U+0647) at the end of a word (according to the study 
of El-Sherbiny et al., 2010, page 5).

To fix this bug the solution is inserting an if clause, where the folding is 
made only an if the Teh Marbuta is not at the end of the word. Suggestion for 
the new case code is following:
{quote}
case TEH_MARBUTA:
  if (i < (len-1))
    s [ i ] = HEH;
  break;
{quote}

References:
El-Sherbiny, A., Farah, M., Oueichek, I., Al-Zoman, A. (2010) Linguistic 
Guidelines for the Use of the Arabic Language in Internet Domains. Internet 
Society Requests For Comment (RFCs) (5564). pp 1-11. Available at: 
http://tools.ietf.org/html/rfc5564.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to