Jorge Cruanes created LUCENE-5986:
-------------------------------------
Summary: Incorrect character folding in Arabic
Key: LUCENE-5986
URL: https://issues.apache.org/jira/browse/LUCENE-5986
Project: Lucene - Core
Issue Type: Bug
Reporter: Jorge Cruanes
The function {{normalize(char s[], int len)}}, in the package
{{org.apache.lucene.analysis.ar.ArabicNormalizer}}, makes an incorrect
character folding in Arabic. The incorrect folding affects the letters Teh
Marbuta (U+0629) and Heh (U+0647) at the end of a word (according to the study
of El-Sherbiny et al., 2010, page 5).
To fix this bug the solution is inserting an if clause, where the folding is
made only an if the Teh Marbuta is not at the end of the word. Suggestion for
the new case code is following:
{quote}
case TEH_MARBUTA:
if (i < (len-1))
s [ i ] = HEH;
break;
{quote}
References:
El-Sherbiny, A., Farah, M., Oueichek, I., Al-Zoman, A. (2010) Linguistic
Guidelines for the Use of the Arabic Language in Internet Domains. Internet
Society Requests For Comment (RFCs) (5564). pp 1-11. Available at:
http://tools.ietf.org/html/rfc5564.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]