Re: [TEXT] How do we want to handle case conversions?

Bruno P. Kinoshita Mon, 22 May 2017 04:19:31 -0700

I'd be in favour of 2 or some variation of it. Provide a well documented naïve 
implementation, and use whatever is available at the JVM for handling 
upper/lower case.
I would use it for very simple cases, where all I need is to capitalise each 
word, and where it would be OK to have possible mistakes in case you have to 
handle text that is not in English, or special cases like some new mathematical 
symbol (e.g. U+1D52B mathematical fraktur small N, which also uses surrogate to 
make it even more interesting).


For cases where I have to take care of different languages (e.g. ch digraph for 
Czech) I would probably use ICU.


For cases that depend on the country, context, or some other feature (e.g. 
names in Dutch with the van preposition) I would probably look at OpenNLP with 
a machine learning or rule based approach.

The issue is that when all I need is the very simple approach now, I would have 
to write something like a for-loop or Java 8 stream and split the text, then 
call toUpperCase on each first char, then write tests for it, etc. I think for 
this case it would still be worth having our simple implementation in [text], 
with docs explaining what it is capable of, and what it is not.

Cheers
Bruno
[] https://codepoints.net/U+1D52B?lang=en
[] https://en.wikipedia.org/wiki/Ch_(digraph)#Czech
[] https://en.wikipedia.org/wiki/Van_(Dutch)#Collation_and_capitalisation
________________________________
From: Duncan Jones <[email protected]>
To: Commons Developers List <[email protected]> 
Sent: Monday, 22 May 2017 12:06 AM
Subject: [TEXT] How do we want to handle case conversions?



Hi everyone,


I’ve found some time to continue breaking WordUtils into separate classes 
(eschewing the “big collection of static methods” approach). However, as I read 
more about case handling in Unicode, I realise how simplistic the WordUtils 
methods are and how complex a full solution would need to be.


Section 5.18 of the Unicode specification [1] describes these complexities. The 
mains ones that bother me are:


1. Title case conversions vary widely between different locales and languages. 
I’m not clear whether any locale is satisfied by the current simplistic 
implementation in WordUtils.capitalize(str). Supporting this correctly would be 
a serious challenge.


2. All types of case conversion may vary depending upon context/locale. There 
are examples provided in [1] where the outcome is different in a Turkish locale 
or if the letter in question is followed by another or not.


Does anyone have a suggestion for how to move forward with this work? I see 
three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] 
Mimic the existing functionality, but document the limitations. 3] Attempt to 
deliver a locale-dependent version, perhaps still with limitations (or for 
certain languages).


I’m leaning towards 2, perhaps even calling the classes “SimpleX…”.


Thanks,

Duncan



[1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]

For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [TEXT] How do we want to handle case conversions?

Reply via email to