[jira] [Commented] (OPENNLP-1268) StringUtil.toLowerCase() should lowercase codepoints, not chars

ASF GitHub Bot (JIRA) Tue, 18 Jun 2019 08:17:17 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866693#comment-16866693
 ]


ASF GitHub Bot commented on OPENNLP-1268:
-----------------------------------------

tballison commented on issue #356: OPENNLP-1268 -- fix StringUtil.toLowerCase() 
to work on codepoints, not chars
URL: https://github.com/apache/opennlp/pull/356#issuecomment-503182316
 
 
   I just confirmed that no current languages would be affected (not even those 
on the small set that I proposed adding on OPENNLP-1270) , and, frankly, the 
languages that are covered by codepoints beyond the bmp are rare (see e.g. 
[Supplementary Multilingual 
Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane)
 and [Supplementary Ideographic 
Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Ideographic_Plane).
   
   However, I noticed zero difference in processing times for codepoints vs 
characters, and I feel that we should correctly handle lowercasing for 
languages that users might want to build their own models for.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> StringUtil.toLowerCase() should lowercase codepoints, not chars
> ---------------------------------------------------------------
>
>                 Key: OPENNLP-1268
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1268
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> {{StringUtils#toLowerCase()}} should run Character.tolowerCase() on code 
> points.  It is currently failing to lowercase characters beyond the BMP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (OPENNLP-1268) StringUtil.toLowerCase() should lowercase codepoints, not chars

Reply via email to