[ 
https://issues.apache.org/jira/browse/OPENNLP-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866694#comment-16866694
 ] 

ASF GitHub Bot commented on OPENNLP-1268:
-----------------------------------------

tballison commented on issue #356: OPENNLP-1268 -- fix StringUtil.toLowerCase() 
to work on codepoints, not chars
URL: https://github.com/apache/opennlp/pull/356#issuecomment-503182316
 
 
   I just confirmed that no current languages would be affected (not even those 
in the small set that I proposed adding on OPENNLP-1270) , and, frankly, the 
languages that are covered by codepoints beyond the bmp are rare (see e.g. 
[Supplementary Multilingual 
Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane)
 and [Supplementary Ideographic 
Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Ideographic_Plane).
   
   However, I noticed zero difference in processing times for codepoints vs 
characters, and I feel that we should correctly handle lowercasing for 
languages that users might want to build their own models for.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> StringUtil.toLowerCase() should lowercase codepoints, not chars
> ---------------------------------------------------------------
>
>                 Key: OPENNLP-1268
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1268
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> {{StringUtils#toLowerCase()}} should run Character.tolowerCase() on code 
> points.  It is currently failing to lowercase characters beyond the BMP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to