Re: RFR: JDK-8032012, , String.toLowerCase/toUpperCase performance improvement

Xueming Shen Wed, 05 Feb 2014 10:30:47 -0800

Hi,

Let's try to wrap it up, otherwise I may drop the ball somewhere :-)


On 01/22/2014 07:20 AM, Paul Sandoz wrote:

if (lang == "tr" || lang == "az" || lang == "lt") {
   // local dependent
   return toLowerCaseEx(result, firstUpper, locale, true);
}
// otherwise false is passed to subsequent calls to toLowerCaseEx

?


toLowerCaseEx will also be invoked later (in your another suggestion next), 
which
needs a "localeDependent".

- you might be able to optimize by doing (could depend on the answer to the 
next point):

  int c = (int)value[i];
  int lc = Character.toLowerCase(c);
  if (.....) { result[i] = (char)lc; } else { return toLowerCaseEx(result, i, 
locale, localeDependent); }

- Do you need to check ERROR for the result of toLowerCase?

2586             if (c == Character.ERROR ||

Yes, Character.toLowerCase() should never return ERROR (while the package 
private
Character.toUpperCaseEx() will). In theory there is no need to check if the 
return
value of  Character.toUpperCase(int)>  min_supplementary_code_point in our loop,
because there is no bmp character returns a supplementary code point as its 
lower
case. But since it's a data driven mapping table, there is no guarantee the 
unicode
data table is not going to change in the "future", so I still keep the check. 
The webrev
has been updated to use one single "if" inside the loop.

I have a "single if" implementation for the toUpperCase() as well, though don't 
sure
which one is better/faster :-)

OK, i suppose one could measure :-)

Not sure how much it is worth obsessing over but...

   int c = (int)value[i];
   if (c<  '\u03A3') {
     result[i] = (char)Character.toLowerCase(c); // Is that safe?
   } else if (c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  (c = 
Character.toLowerCase(c))<  Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
     result[i] = (char)c;
   } else {
     return toLowerCaseEx(result, i, locale, localeDependent);
   }

or:

   int c = (int)value[i];
   int lc = Character.toLowerCase(c); // is that safe?
   if (c<  '\u03A3') {
     result[i] = (char)lc;
   } else if (c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  lc<  
Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
     result[i] = (char)lc;
   } else {
     return toLowerCaseEx(result, i, locale, localeDependent);
   }

or:

   int c = (int)value[i];
   int lc = Character.toLowerCase(c); // is that safe?
   if (c<  '\u03A3' || (c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  lc< 
 Character.MIN_SUPPLEMENTARY_CODE_POINT))) {
     result[i] = (char)lc;
   } else {
     return toLowerCaseEx(result, i, locale, localeDependent);
   }

FWIW i personally find those solutions easier to read, if they are safe w.r.t. 
Character.toLowerCase and that annoying greek character.


The existing code has the clear and explicit logic that

(1) j.l.C.toLowerCase() can NOT handle surrogate(s)
(2) j.l.C.toLowerCase() can NOT handle \u03A3
(3) if for some reasons j.l.C.toLowerCase() returns a supplementary character
     for a non-bmp character, the current "one-char" based result can NOT handle
     it,
So in above cases, go to "ex" version.

The suggested approach is kinda of hacky and the logic is not clear.
For example it tries to optimize with c < \u03a3 cases, it works with the 
assumption
that all lower cases of < \u03a3 characters are never a supplementary 
characters,
which is true for now and probably will be true forever, but it actually 
changes the
design and implementation logic (be driven by the backend unicode case mapping
data, not an "assumption").

The

(c<  Character.MIN_HIGH_SURROGATE&&  c != 'u03A3'&&  lc<  
Character.MIN_SUPPLEMENTARY_CODE_POINT)

part appears to be fine, but its logic is not as explicit as the
existing one.

Given the measurement indicates it does not bring in measurable
improvement, personally I would prefer to keep the existing one,
if you don't have a very strong opinion on this.

http://cr.openjdk.java.net/~sherman/8032012/

Thanks!
-Sherman

Re: RFR: JDK-8032012, , String.toLowerCase/toUpperCase performance improvement

Reply via email to