> On May 22, 2017, at 6:04 AM, sebb <seb...@gmail.com> wrote:
> 
> On 22 May 2017 at 06:56, Duncan Jones <dun...@wortharead.com> wrote:
>> 
>>> On 21 May 2017, at 19:43, Gary Gregory <garydgreg...@gmail.com> wrote:
>>> 
>>> Pardon the obvious but what is missing from methods like
>>> https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)
>>> 
>>> Gary
>> 
>> 
>> The WordUtils methods turn sentences into title case, which Java’s core 
>> libraries don’t offer. In fact, the core libraries make doing 
>> locale-sensitive title case conversions very difficult (see 
>> http://stackoverflow.com/questions/7360996/unicode-correct-title-case-in-java
>>  for example).
>> 
>> Doing title casing correctly is quite a subtle art. We don’t even do it 
>> correctly for English at the moment, which would normally capitalise “The 
>> Life of Reilly” rather than “The Life Of Reilly”. Other languages have 
>> completely different conventions or additional complexities.
>> 
> 
> However the Javadoc does state that the capitalisation is based on
> words, not sentences.
> So I don't know if there is any expectation that it will take account
> of the meaning of the words.
> 
> I guess the question is whether that is useful at all?
> If so, we should clarify that the processing takes no account of the
> meaning of the words.
> If not, we should perhaps drop the methods.
> 
> I think it will be a huge effort to produce anything that works
> properly even for US English, let alone UK English.
> 

I agree here with the level of effort needed to properly capitalize anything in 
a semantic fashion without some approximation mechanics. The only clear way to 
do capitalization in a deterministic fashion is simply to rely upon delimiters. 

I would think that admitting defeat (for commons) isn’t an unreasonable option, 
with the possibility of putting the bulk of the work in OpenNLP. I would think 
that would be a better venue for such an algorithm because of the mechanics of 
language determination being present there and not here.

> Names will be a particular problem: ee cummings, D'Ath, O'Toole, MacDonald
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to