Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

Joe Wang Wed, 22 Jul 2020 14:33:14 -0700


On 7/22/20 1:43 PM, [email protected] wrote:

Thanks Roger,

Ah, I just saw your email just after I sent mine!


They probably saw each other crossing and said hi on the way to inboxes ;-)

On 7/22/20 1:38 PM, Roger Riggs wrote:
Hi Naoto,

Looks fine. (with Joe's suggestion)

On 7/22/20 4:20 PM, Joe Wang wrote:
Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)
The only minor comment I have is that the compareCodePointCI methodperforms toUpperCase unconditionally. That's not a problem for theregular case, where a check on cp1 == cp2 (line 337) is done priorto the method call. But for the sup case (starting at line 341), themethod is called unconditionally while in webrev.04 there was acheck "cp1 != cp2". One option to fix it is to include the "cp1 !=cp2" check in the method compareCodePointCI, then cp1 == cp2 at line337 can be omitted.
I would have added to line 353 the same cp1 == cp2 check as 337 toavoid the method call
unless it was needed.
As in the previous email, cp1 != cp2 at that point, since either highor low surrogates differ.


Make sense. The webrev looks good to me.

-Joe

Naoto
Roger
Regards,
Joe

On 7/22/20 10:23 AM, [email protected] wrote:
Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):
- The short cut now does case insensitive comparison that makes thefix closer to the previous implementation (for BMP characters).- Changed the bit operation to negating for detecting needed indexincrement.- Method name is changed to better reflect what it is doing, withmore descriptive comments.
Here is the benchmark results:

before:
Benchmark Mode Cnt Score Error UnitsStringCompareToIgnoreCase.lower avgt 25 49.960 ? 1.923ns/opStringCompareToIgnoreCase.supLower avgt 25 21.003 ? 0.354ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 30.863 ? 4.529ns/opStringCompareToIgnoreCase.upperLower avgt 25 15.417 ? 1.046ns/op
after:
Benchmark Mode Cnt Score Error UnitsStringCompareToIgnoreCase.lower avgt 25 46.857 ? 0.524ns/opStringCompareToIgnoreCase.supLower avgt 25 148.688 ? 6.546ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 37.160 ? 0.259ns/opStringCompareToIgnoreCase.upperLower avgt 25 15.126 ? 0.338ns/op
Now non-supplementary operations ("lower" and "upperLower") are onpar with the "before" result (I am not quite sure why the "after"results are somewhat faster though). For supplementary test cases,"supLower" is very slow. The reason is two fold; one is because"before" one exits at the very first character (which I amaddressing here) while "after" continues to compare to the lastcharacters, the other reason is the test suffers from the changewhere supplementary cases double the case insensitivity checks(compared to the "after" result just below). Also "supUpperLower"gets slower for the same reason. These are expected results forsupplementary comparisons (as we discussed).
Naoto

On 7/17/20 4:36 PM, [email protected] wrote:
Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:
- Shared the implementation between compareToCI() andregionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before andafter the change:
before:
Benchmark Mode Cnt Score Error UnitsStringCompareToIgnoreCase.lower avgt 25 53.764 ? 2.811ns/opStringCompareToIgnoreCase.supLower avgt 25 24.211 ? 1.135ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 30.595 ? 1.344ns/opStringCompareToIgnoreCase.upperLower avgt 25 18.859 ? 1.499ns/op
after:
Benchmark Mode Cnt Score Error UnitsStringCompareToIgnoreCase.lower avgt 25 58.354 ? 4.603ns/opStringCompareToIgnoreCase.supLower avgt 25 57.975 ? 5.672ns/opStringCompareToIgnoreCase.supUpperLower avgt 25 23.912 ? 0.965ns/opStringCompareToIgnoreCase.upperLower avgt 25 17.744 ? 0.272ns/op
Here, "sup" means all supplementary characters, BMP otherwise."lower" means each character requires upper->lower casemap."upperLower" means all characters are the same, except the lastcharacter which requires casemap.
I think the result is reasonable, considering surrogates check arenow mandatory. I have tried Roger's suggestion to useArrays.mismatch() but it did not seem to benefit here. In fact,the performance degraded partly because I implemented the shortcut, and possibly for the overhead of extra checks.
Naoto

On 7/15/20 9:00 AM, [email protected] wrote:
Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664
A bug was filed against SimpleDateFormat (8248434) wherecase-insensitive date format/parse failed in some of the newlocales in JDK15. The root cause was that case-insensitiveString.regionMatches() method did not work with supplementarycharacters. The problem is that the method's spec does not expectcase mappings of supplementary characters, possibly because itwas overlooked in the first place, JSR 204 - "UnicodeSupplementary Character support". Similar behavior is observed inother two case-insensitive methods, i.e., compareToIgnoreCase()and equalsIgnoreCase().
The fix is straightforward to compare strings by code pointbasis, instead of code unit (16bit "char") basis. Technicallythis change will introduce a backward incompatibility, but Ibelieve it is an incompatibility to wrong behavior, not true tothe meaning of those methods' expectations.
Naoto

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

Reply via email to