I'd be in favour of 2 or some variation of it. Provide a well documented naïve implementation, and use whatever is available at the JVM for handling upper/lower case. I would use it for very simple cases, where all I need is to capitalise each word, and where it would be OK to have possible mistakes in case you have to handle text that is not in English, or special cases like some new mathematical symbol (e.g. U+1D52B mathematical fraktur small N, which also uses surrogate to make it even more interesting).
For cases where I have to take care of different languages (e.g. ch digraph for Czech) I would probably use ICU. For cases that depend on the country, context, or some other feature (e.g. names in Dutch with the van preposition) I would probably look at OpenNLP with a machine learning or rule based approach. The issue is that when all I need is the very simple approach now, I would have to write something like a for-loop or Java 8 stream and split the text, then call toUpperCase on each first char, then write tests for it, etc. I think for this case it would still be worth having our simple implementation in [text], with docs explaining what it is capable of, and what it is not. Cheers Bruno [] https://codepoints.net/U+1D52B?lang=en [] https://en.wikipedia.org/wiki/Ch_(digraph)#Czech [] https://en.wikipedia.org/wiki/Van_(Dutch)#Collation_and_capitalisation ________________________________ From: Duncan Jones <dun...@wortharead.com> To: Commons Developers List <dev@commons.apache.org> Sent: Monday, 22 May 2017 12:06 AM Subject: [TEXT] How do we want to handle case conversions? Hi everyone, I’ve found some time to continue breaking WordUtils into separate classes (eschewing the “big collection of static methods” approach). However, as I read more about case handling in Unicode, I realise how simplistic the WordUtils methods are and how complex a full solution would need to be. Section 5.18 of the Unicode specification [1] describes these complexities. The mains ones that bother me are: 1. Title case conversions vary widely between different locales and languages. I’m not clear whether any locale is satisfied by the current simplistic implementation in WordUtils.capitalize(str). Supporting this correctly would be a serious challenge. 2. All types of case conversion may vary depending upon context/locale. There are examples provided in [1] where the outcome is different in a Turkish locale or if the letter in question is followed by another or not. Does anyone have a suggestion for how to move forward with this work? I see three options: 1] Admit defeat and avoid the case conversion mess entirely. 2] Mimic the existing functionality, but document the limitations. 3] Attempt to deliver a locale-dependent version, perhaps still with limitations (or for certain languages). I’m leaning towards 2, perhaps even calling the classes “SimpleX…”. Thanks, Duncan [1] http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org