[
https://issues.apache.org/jira/browse/OPENNLP-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866693#comment-16866693
]
ASF GitHub Bot commented on OPENNLP-1268:
-----------------------------------------
tballison commented on issue #356: OPENNLP-1268 -- fix StringUtil.toLowerCase()
to work on codepoints, not chars
URL: https://github.com/apache/opennlp/pull/356#issuecomment-503182316
I just confirmed that no current languages would be affected (not even those
on the small set that I proposed adding on OPENNLP-1270) , and, frankly, the
languages that are covered by codepoints beyond the bmp are rare (see e.g.
[Supplementary Multilingual
Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane)
and [Supplementary Ideographic
Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Ideographic_Plane).
However, I noticed zero difference in processing times for codepoints vs
characters, and I feel that we should correctly handle lowercasing for
languages that users might want to build their own models for.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> StringUtil.toLowerCase() should lowercase codepoints, not chars
> ---------------------------------------------------------------
>
> Key: OPENNLP-1268
> URL: https://issues.apache.org/jira/browse/OPENNLP-1268
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Trivial
>
> {{StringUtils#toLowerCase()}} should run Character.tolowerCase() on code
> points. It is currently failing to lowercase characters beyond the BMP.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)