Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread Brent Christian

Hi, Naoto

The latest changes look good to me.

-Brent

On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the fix 
closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with more 
descriptive comments.




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread Joe Wang



On 7/22/20 1:43 PM, naoto.s...@oracle.com wrote:

Thanks Roger,

Ah, I just saw your email just after I sent mine!


They probably saw each other crossing and said hi on the way to inboxes ;-)


On 7/22/20 1:38 PM, Roger Riggs wrote:

Hi Naoto,

Looks fine. (with Joe's suggestion)

On 7/22/20 4:20 PM, Joe Wang wrote:

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI method 
performs toUpperCase unconditionally. That's not a problem for the 
regular case, where a check on cp1 == cp2 (line 337) is done prior 
to the method call. But for the sup case (starting at line 341), the 
method is called unconditionally while in webrev.04 there was a 
check "cp1 != cp2".  One option to fix it is to include the "cp1 != 
cp2" check in the method compareCodePointCI, then cp1 == cp2 at line 
337 can be omitted.
I would have added to line 353 the same cp1 == cp2 check as 337 to 
avoid the method call

unless it was needed.


As in the previous email, cp1 != cp2 at that point, since either high 
or low surrogates differ.


Make sense. The webrev looks good to me.

-Joe



Naoto



Roger



Regards,
Joe

On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the 
fix closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with 
more descriptive comments.


Here is the benchmark results:

before:
Benchmark    Mode  Cnt   Score Error  
Units
StringCompareToIgnoreCase.lower  avgt   25  49.960 ? 1.923 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  21.003 ? 0.354 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  15.417 ? 1.046 
ns/op


after:
Benchmark    Mode  Cnt    Score Error  
Units
StringCompareToIgnoreCase.lower  avgt   25   46.857 ? 0.524 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  148.688 ? 6.546 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25   15.126 ? 0.338 
ns/op


Now non-supplementary operations ("lower" and "upperLower") are on 
par with the "before" result (I am not quite sure why the "after" 
results are somewhat faster though). For supplementary test cases, 
"supLower" is very slow. The reason is two fold; one is because 
"before" one exits at the very first character (which I am 
addressing here) while "after" continues to compare to the last 
characters, the other reason is the test suffers from the change 
where supplementary cases double the case insensitivity checks 
(compared to the "after" result just below). Also "supUpperLower" 
gets slower for the same reason. These are expected results for 
supplementary comparisons (as we discussed).


Naoto

On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and 
regionMatchesCI()

- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and 
after the change:


before:
Benchmark    Mode  Cnt   Score Error  
Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499 
ns/op


after:
Benchmark    Mode  Cnt   Score Error  
Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272 
ns/op


Here, "sup" means all supplementary characters, BMP otherwise. 
"lower" means each character requires upper->lower casemap. 
"upperLower" means all characters are the same, except the last 
character which requires casemap.


I think the result is reasonable, considering surrogates check are 
now mandatory. I have tried Roger's suggestion to use 
Arrays.mismatch() but it did not seem to benefit here. In fact, 
the performance degraded partly because I implemented the short 
cut, and possibly for the overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread naoto . sato

Thanks Roger,

Ah, I just saw your email just after I sent mine!

On 7/22/20 1:38 PM, Roger Riggs wrote:

Hi Naoto,

Looks fine. (with Joe's suggestion)

On 7/22/20 4:20 PM, Joe Wang wrote:

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI method 
performs toUpperCase unconditionally. That's not a problem for the 
regular case, where a check on cp1 == cp2 (line 337) is done prior to 
the method call. But for the sup case (starting at line 341), the 
method is called unconditionally while in webrev.04 there was a check 
"cp1 != cp2".  One option to fix it is to include the "cp1 != cp2" 
check in the method compareCodePointCI, then cp1 == cp2 at line 337 
can be omitted.
I would have added to line 353 the same cp1 == cp2 check as 337 to avoid 
the method call

unless it was needed.


As in the previous email, cp1 != cp2 at that point, since either high or 
low surrogates differ.


Naoto



Roger



Regards,
Joe

On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the 
fix closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with 
more descriptive comments.


Here is the benchmark results:

before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  49.960 ? 1.923 ns/op
StringCompareToIgnoreCase.supLower   avgt   25  21.003 ? 0.354 ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529 ns/op
StringCompareToIgnoreCase.upperLower avgt   25  15.417 ? 1.046 ns/op

after:
Benchmark    Mode  Cnt    Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25   46.857 ? 0.524 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  148.688 ? 6.546 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25   15.126 ? 0.338 
ns/op


Now non-supplementary operations ("lower" and "upperLower") are on 
par with the "before" result (I am not quite sure why the "after" 
results are somewhat faster though). For supplementary test cases, 
"supLower" is very slow. The reason is two fold; one is because 
"before" one exits at the very first character (which I am addressing 
here) while "after" continues to compare to the last characters, the 
other reason is the test suffers from the change where supplementary 
cases double the case insensitivity checks (compared to the "after" 
result just below). Also "supUpperLower" gets slower for the same 
reason. These are expected results for supplementary comparisons (as 
we discussed).


Naoto

On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and 
after the change:


before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499 
ns/op


after:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603 
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965 
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272 
ns/op


Here, "sup" means all supplementary characters, BMP otherwise. 
"lower" means each character requires upper->lower casemap. 
"upperLower" means all characters are the same, except the last 
character which requires casemap.


I think the result is reasonable, considering surrogates check are 
now mandatory. I have tried Roger's suggestion to use 
Arrays.mismatch() but it did not seem to benefit here. In fact, the 
performance degraded partly because I implemented the short cut, and 
possibly for the overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread Roger Riggs

Hi Naoto,

Looks fine. (with Joe's suggestion)

On 7/22/20 4:20 PM, Joe Wang wrote:

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI method 
performs toUpperCase unconditionally. That's not a problem for the 
regular case, where a check on cp1 == cp2 (line 337) is done prior to 
the method call. But for the sup case (starting at line 341), the 
method is called unconditionally while in webrev.04 there was a check 
"cp1 != cp2".  One option to fix it is to include the "cp1 != cp2" 
check in the method compareCodePointCI, then cp1 == cp2 at line 337 
can be omitted.
I would have added to line 353 the same cp1 == cp2 check as 337 to avoid 
the method call

unless it was needed.

Roger



Regards,
Joe

On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the 
fix closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with 
more descriptive comments.


Here is the benchmark results:

before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  49.960 ? 1.923  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  21.003 ? 0.354  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  15.417 ? 1.046  
ns/op


after:
Benchmark    Mode  Cnt    Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25   46.857 ? 0.524  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  148.688 ? 6.546  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25   15.126 ? 0.338  
ns/op


Now non-supplementary operations ("lower" and "upperLower") are on 
par with the "before" result (I am not quite sure why the "after" 
results are somewhat faster though). For supplementary test cases, 
"supLower" is very slow. The reason is two fold; one is because 
"before" one exits at the very first character (which I am addressing 
here) while "after" continues to compare to the last characters, the 
other reason is the test suffers from the change where supplementary 
cases double the case insensitivity checks (compared to the "after" 
result just below). Also "supUpperLower" gets slower for the same 
reason. These are expected results for supplementary comparisons (as 
we discussed).


Naoto

On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and 
after the change:


before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499  
ns/op


after:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272  
ns/op


Here, "sup" means all supplementary characters, BMP otherwise. 
"lower" means each character requires upper->lower casemap. 
"upperLower" means all characters are the same, except the last 
character which requires casemap.


I think the result is reasonable, considering surrogates check are 
now mandatory. I have tried Roger's suggestion to use 
Arrays.mismatch() but it did not seem to benefit here. In fact, the 
performance degraded partly because I implemented the short cut, and 
possibly for the overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread naoto . sato

Hi Joe,

Thank you for the consecutive reviews!

On 7/22/20 1:20 PM, Joe Wang wrote:

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI method 
performs toUpperCase unconditionally. That's not a problem for the 
regular case, where a check on cp1 == cp2 (line 337) is done prior to 
the method call. But for the sup case (starting at line 341), the method 
is called unconditionally while in webrev.04 there was a check "cp1 != 
cp2".  One option to fix it is to include the "cp1 != cp2" check in the 
method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted.


That was intentional, as at the point when it calls compareCodePointCI() 
for the second time, it is guaranteed that the supplementary code points 
differ because either their high surrogates or low surrogates differ.


Naoto



Regards,
Joe

On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the 
fix closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with more 
descriptive comments.


Here is the benchmark results:

before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  49.960 ? 1.923  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  21.003 ? 0.354  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  15.417 ? 1.046  ns/op

after:
Benchmark    Mode  Cnt    Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25   46.857 ? 0.524 ns/op
StringCompareToIgnoreCase.supLower   avgt   25  148.688 ? 6.546 ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259 ns/op
StringCompareToIgnoreCase.upperLower avgt   25   15.126 ? 0.338 ns/op

Now non-supplementary operations ("lower" and "upperLower") are on par 
with the "before" result (I am not quite sure why the "after" results 
are somewhat faster though). For supplementary test cases, "supLower" 
is very slow. The reason is two fold; one is because "before" one 
exits at the very first character (which I am addressing here) while 
"after" continues to compare to the last characters, the other reason 
is the test suffers from the change where supplementary cases double 
the case insensitivity checks (compared to the "after" result just 
below). Also "supUpperLower" gets slower for the same reason. These 
are expected results for supplementary comparisons (as we discussed).


Naoto

On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after 
the change:


before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811 ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135 ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344 ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499 ns/op

after:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603 ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672 ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965 ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272 ns/op

Here, "sup" means all supplementary characters, BMP otherwise. 
"lower" means each character requires upper->lower casemap. 
"upperLower" means all characters are the same, except the last 
character which requires casemap.


I think the result is reasonable, considering surrogates check are 
now mandatory. I have tried Roger's suggestion to use 
Arrays.mismatch() but it did not seem to benefit here. In fact, the 
performance degraded partly because I implemented the short cut, and 
possibly for the overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread Joe Wang

Hi Naoto,

The change looks good to me. "supLower" is indeed super slow :-)

The only minor comment I have is that the compareCodePointCI method 
performs toUpperCase unconditionally. That's not a problem for the 
regular case, where a check on cp1 == cp2 (line 337) is done prior to 
the method call. But for the sup case (starting at line 341), the method 
is called unconditionally while in webrev.04 there was a check "cp1 != 
cp2".  One option to fix it is to include the "cp1 != cp2" check in the 
method compareCodePointCI, then cp1 == cp2 at line 337 can be omitted.


Regards,
Joe

On 7/22/20 10:23 AM, naoto.s...@oracle.com wrote:

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the 
fix closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with more 
descriptive comments.


Here is the benchmark results:

before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  49.960 ? 1.923  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  21.003 ? 0.354  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  15.417 ? 1.046  ns/op

after:
Benchmark    Mode  Cnt    Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25   46.857 ? 0.524  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  148.688 ? 6.546  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25   15.126 ? 0.338  
ns/op


Now non-supplementary operations ("lower" and "upperLower") are on par 
with the "before" result (I am not quite sure why the "after" results 
are somewhat faster though). For supplementary test cases, "supLower" 
is very slow. The reason is two fold; one is because "before" one 
exits at the very first character (which I am addressing here) while 
"after" continues to compare to the last characters, the other reason 
is the test suffers from the change where supplementary cases double 
the case insensitivity checks (compared to the "after" result just 
below). Also "supUpperLower" gets slower for the same reason. These 
are expected results for supplementary comparisons (as we discussed).


Naoto

On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after 
the change:


before:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499  
ns/op


after:
Benchmark    Mode  Cnt   Score Error  Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603  
ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672  
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  
ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272  
ns/op


Here, "sup" means all supplementary characters, BMP otherwise. 
"lower" means each character requires upper->lower casemap. 
"upperLower" means all characters are the same, except the last 
character which requires casemap.


I think the result is reasonable, considering surrogates check are 
now mandatory. I have tried Roger's suggestion to use 
Arrays.mismatch() but it did not seem to benefit here. In fact, the 
performance degraded partly because I implemented the short cut, and 
possibly for the overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales 
in JDK15. The root cause was that case-insensitive 
String.regionMatches() method did not work with supplementary 
characters. The problem is 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-22 Thread naoto . sato

Hi,

I revised the fix again, based on further suggestions:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.05/

Changes from v.04 are (all in StringUTF16.java):

- The short cut now does case insensitive comparison that makes the fix 
closer to the previous implementation (for BMP characters).
- Changed the bit operation to negating for detecting needed index 
increment.
- Method name is changed to better reflect what it is doing, with more 
descriptive comments.


Here is the benchmark results:

before:
BenchmarkMode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower  avgt   25  49.960 ? 1.923  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  21.003 ? 0.354  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.863 ? 4.529  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  15.417 ? 1.046  ns/op

after:
BenchmarkMode  CntScore   Error  Units
StringCompareToIgnoreCase.lower  avgt   25   46.857 ? 0.524  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  148.688 ? 6.546  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25   37.160 ? 0.259  ns/op
StringCompareToIgnoreCase.upperLower avgt   25   15.126 ? 0.338  ns/op

Now non-supplementary operations ("lower" and "upperLower") are on par 
with the "before" result (I am not quite sure why the "after" results 
are somewhat faster though). For supplementary test cases, "supLower" is 
very slow. The reason is two fold; one is because "before" one exits at 
the very first character (which I am addressing here) while "after" 
continues to compare to the last characters, the other reason is the 
test suffers from the change where supplementary cases double the case 
insensitivity checks (compared to the "after" result just below). Also 
"supUpperLower" gets slower for the same reason. These are expected 
results for supplementary comparisons (as we discussed).


Naoto

On 7/17/20 4:36 PM, naoto.s...@oracle.com wrote:

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after 
the change:


before:
Benchmark    Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499  ns/op

after:
Benchmark    Mode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272  ns/op

Here, "sup" means all supplementary characters, BMP otherwise. "lower" 
means each character requires upper->lower casemap. "upperLower" means 
all characters are the same, except the last character which requires 
casemap.


I think the result is reasonable, considering surrogates check are now 
mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but 
it did not seem to benefit here. In fact, the performance degraded 
partly because I implemented the short cut, and possibly for the 
overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales 
in JDK15. The root cause was that case-insensitive 
String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case 
mappings of supplementary characters, possibly because it was 
overlooked in the first place, JSR 204 - "Unicode Supplementary 
Character support". Similar behavior is observed in other two 
case-insensitive methods, i.e., compareToIgnoreCase() and 
equalsIgnoreCase().


The fix is straightforward to compare strings by code point basis, 
instead of code unit (16bit "char") basis. Technically this change 
will introduce a backward incompatibility, but I believe it is an 
incompatibility to wrong behavior, not true to the meaning of those 
methods' expectations.


Naoto


Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread naoto . sato

Hi Brent,

On 7/21/20 2:56 PM, Brent Christian wrote:

Hi, Naoto

I have a few comments:

src/java.base/share/classes/java/lang/StringUTF16.java

379 private static int getSupplementaryCodePoint(byte[] ba, int cp, 
int index, int start, int end)


I think it would be worth a small addition to the comment to reflect 
that non-surrogate chars are returned as-is.


Sure, I will add some more comments to the method.



--

I thought about the scenario of an unpaired low or high surrogate at the 
beginning or end of the string, respectively:


384 if (Character.isLowSurrogate((char)cp)) {
385 if (index > start) {
     ...
391 } else if (index + 1 < end) { // cp == high surrogate
392 char c = getChar(ba, index + 1);
     ...
397 return cp;

It looks like the cp itself would be returned from 
getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's 
converted using Character.to[Upper|Lower]Case(int), which will also 
return the cp itself.  I imagine that's the best we could do, so seems 
fine.


Yes, that is exactly what is intended.



Is there a test case for unmatched surrogates at the beginning and end 
of the string ? Should there be ?


Interestingly, there has been a test case for supplementary characters 
before this change, where it intentionally begins from a low surrogate, 
and ends with a high surrogate, so that it would succeed in the previous 
*exact match* logic. Line 82 in RegionMatches.java tests:
"\uD801\uDC28\uD801\uDC29\uFF41a".regionMatches(true, 1, "\uDC28\uD801", 
0, 2) == true

And the proposed change is compatible with this test case.



--

I see there are no changes to StringLatin1.regionMatchesCI_UTF16().  I 
presume there are no cases in which toUpperCase(toLowerCase()) of a 
supplementary character could yield a Latin-1 character, yes?


Yes, that is correct.

Naoto



Also, thanks for adding the benchmark!

-Brent

On 7/20/20 3:20 PM, naoto.s...@oracle.com wrote:

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead 
of adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). 
Since we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it 
would be considered a regression. I would suggest you run the 
specJVM. I agree with you on surrogate check being a requirement, 
thus supLower being 139% slower is ok since it won't otherwise be 
correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, 
but the test suite is aimed at the entire application performance. 
But for this one, it is a micro benchmark for relatively rarely 
issued methods (I would think normal cases fall into Latin1 
equivalents), I would consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very 
long sample strings to see if we can reduce the noise level and also 
see how it fares for a longer string. I assume the machine you're 
running the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that 
are almost identical, but one last character is case insensitively 
equal. So in these cases, the new short cut works really well and not 
call toLower/UpperCase() at all for most of the characters. Thus the 
new results are faster. Again the test result is very dependent on 
the input data, Unless the result showed 100% slower or something 
(except supLower case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the 
slower codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not 
need the complication.


The implementation is dealing with bare byte 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread Brent Christian

Hi, Naoto

I have a few comments:

src/java.base/share/classes/java/lang/StringUTF16.java

379 private static int getSupplementaryCodePoint(byte[] ba, int cp, 
int index, int start, int end)


I think it would be worth a small addition to the comment to reflect 
that non-surrogate chars are returned as-is.


--

I thought about the scenario of an unpaired low or high surrogate at the 
beginning or end of the string, respectively:


384 if (Character.isLowSurrogate((char)cp)) {
385 if (index > start) {
...
391 } else if (index + 1 < end) { // cp == high surrogate
392 char c = getChar(ba, index + 1);
...
397 return cp;

It looks like the cp itself would be returned from 
getSupplementaryCodePoint(). And then back in compareToCIImpl(), it's 
converted using Character.to[Upper|Lower]Case(int), which will also 
return the cp itself.  I imagine that's the best we could do, so seems fine.


Is there a test case for unmatched surrogates at the beginning and end 
of the string ? Should there be ?


--

I see there are no changes to StringLatin1.regionMatchesCI_UTF16().  I 
presume there are no cases in which toUpperCase(toLowerCase()) of a 
supplementary character could yield a Latin-1 character, yes?


Also, thanks for adding the benchmark!

-Brent

On 7/20/20 3:20 PM, naoto.s...@oracle.com wrote:

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead of 
adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). Since 
we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it would 
be considered a regression. I would suggest you run the specJVM. I 
agree with you on surrogate check being a requirement, thus supLower 
being 139% slower is ok since it won't otherwise be correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, but 
the test suite is aimed at the entire application performance. But for 
this one, it is a micro benchmark for relatively rarely issued methods 
(I would think normal cases fall into Latin1 equivalents), I would 
consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very long 
sample strings to see if we can reduce the noise level and also see 
how it fares for a longer string. I assume the machine you're running 
the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that 
are almost identical, but one last character is case insensitively 
equal. So in these cases, the new short cut works really well and not 
call toLower/UpperCase() at all for most of the characters. Thus the 
new results are faster. Again the test result is very dependent on the 
input data, Unless the result showed 100% slower or something (except 
supLower case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the slower 
codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not need 
the complication.


The implementation is dealing with bare byte arrays. Only methods 
that it uses from Character class are toLowerCase(int) and 
toUpperCase(int) (sans surrogate check, which is needed at each 
iteration anyways), and their "char" equivalents are merely casting 
(char) to the int result. So it might not be so beneficial to 
differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), so 
I revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread naoto . sato

Thank you, Joe. I got it now. Will try out and benchmark.

Naoto

On 7/21/20 10:05 AM, Joe Wang wrote:



On 7/20/2020 8:58 PM, naoto.s...@oracle.com wrote:
The short-cut worked well. There's maybe a further optimization we 
could do to rid us of the performance concern (or the need to run 
additional performance tests). Consider the cases where strings in 
comparison don't contain any sup characters, if we make the 
toLower/UpperCase() block a method and call it before and after the 
surrogate-check block, the routine would be effectively the same as 
prior to the introduction of the surrogate-check block, and regular 
comparisons would suffer the surrogate-check only once (the last 
check). For strings that do contain sup characters then, the 
toLower/UpperCase() method would have been called twice, but then we 
don't care about the performance in that situation. You may call the 
existing codePointAt method too when an extra getChar and performance 
is not a concern (but that's your call.


Can you please elaborate this more? What's "the last check" here?


What I meant was that we could expand the 'short-cut' from case 
sensitive to case insensitive, that is in addition to the line 337, do 
that line 353 - 370 case-insensitive check as well.


I guess it can be explained better with code. I added inline comment:

     for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; 
k1++, k2++) {

     int cp1 = (int)getChar(value, k1);
     int cp2 = (int)getChar(other, k2);

// does a case-insensitive check:

     if (checkEqual(cp1, cp2) == 0) {
     continue;
     }

// this block will be run once for strings that don't contain any 
supplementary characters


  // Check for supplementary characters case
     cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset, 
tlast);

     if ((cp1 & Integer.MIN_VALUE) != 0) {
     k1++;
     cp1 ^= Integer.MIN_VALUE;
     }
     cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset, 
olast);

     if ((cp2 & Integer.MIN_VALUE) != 0) {
     k2++;
     cp2 ^= Integer.MIN_VALUE;
     }


// thischeck will have been called twice for strings that contain 
supplementary characters

// but only one more for strings that don't

     int diff = checkEqual(cp1, cp2);
     if (diff != 0) {
     return diff;
     }
     }
     return tlen - olen;
     }

// the code block between line 353 - 370 in webrev.04 except the last 
line (return 0):

     private static int checkEqual(int cp1, int cp2) {
     if (cp1 != cp2) {
     // try converting both characters to uppercase.
     // If the results match, then the comparison scan should
     // continue.
     cp1 = Character.toUpperCase(cp1);
     cp2 = Character.toUpperCase(cp2);
     if (cp1 != cp2) {
     // Unfortunately, conversion to uppercase does not work 
properly
     // for the Georgian alphabet, which has strange rules 
about case

     // conversion.  So we need to make one last check before
     // exiting.
     cp1 = Character.toLowerCase(cp1);
     cp2 = Character.toLowerCase(cp2);
     if (cp1 != cp2) {
     return cp1 - cp2;
     }
     }
     }
     return 0;
     }





Naoto 




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread naoto . sato

Thank you, Roger.

On 7/21/20 7:50 AM, Roger Riggs wrote:

Hi Naoto,

StringUTF16.java:

343, 348: The masking and comparisons seem awkward.
I'd suggest just negating the value and testing for < 0 to flag the less 
usual case.
If you continue with the flag bit, define your own static final constant 
for that bit;

It looks odd to be using Integer.MIN_VALUE with bit operators.


Right. I will change it to negating.



Performance wise, I'm a bit cautious about all the if's and branches;

And only calling getSupplementaryCodePoint if cp was a suggorate might 
have some

benefit, but its hard to tell what will be inlined and when.


I think Joe's suggestion will help here. Will update the webrev.

Naoto



Thanks, Roger



On 7/20/20 6:20 PM, naoto.s...@oracle.com wrote:

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead 
of adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). 
Since we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it 
would be considered a regression. I would suggest you run the 
specJVM. I agree with you on surrogate check being a requirement, 
thus supLower being 139% slower is ok since it won't otherwise be 
correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, 
but the test suite is aimed at the entire application performance. 
But for this one, it is a micro benchmark for relatively rarely 
issued methods (I would think normal cases fall into Latin1 
equivalents), I would consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very 
long sample strings to see if we can reduce the noise level and also 
see how it fares for a longer string. I assume the machine you're 
running the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that 
are almost identical, but one last character is case insensitively 
equal. So in these cases, the new short cut works really well and not 
call toLower/UpperCase() at all for most of the characters. Thus the 
new results are faster. Again the test result is very dependent on 
the input data, Unless the result showed 100% slower or something 
(except supLower case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the 
slower codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not 
need the complication.


The implementation is dealing with bare byte arrays. Only methods 
that it uses from Character class are toLowerCase(int) and 
toUpperCase(int) (sans surrogate check, which is needed at each 
iteration anyways), and their "char" equivalents are merely casting 
(char) to the int result. So it might not be so beneficial to 
differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), 
so I revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and 
regionMatchesCI()

    - Enabled immediate short cut if two code points match.
    - Created a simple JMH benchmark. Here is the scores before 
and after

    the change:

    before:
    Benchmark                                Mode  Cnt  Score 
 Error     Units
    StringCompareToIgnoreCase.lower          avgt   25 53.764 ? 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread Mark Davis ☕️
One option is to have a fast path that uses char functions, up to the point
where you hit a high surrogate, then drop into the slower codepoint path.
That saves time for the high-runner cases.

On the other hand, if the times are good enough, you might not need the
complication.

Mark


On Fri, Jul 17, 2020 at 4:39 PM  wrote:

> Hi,
>
> Based on the suggestions, I modified the fix as follows:
>
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/
>
> Changes from the initial revision are:
>
> - Shared the implementation between compareToCI() and regionMatchesCI()
> - Enabled immediate short cut if two code points match.
> - Created a simple JMH benchmark. Here is the scores before and after
> the change:
>
> before:
> BenchmarkMode  Cnt   Score   Error  Units
> StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811  ns/op
> StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  ns/op
> StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499  ns/op
>
> after:
> BenchmarkMode  Cnt   Score   Error  Units
> StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603  ns/op
> StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672  ns/op
> StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  ns/op
> StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272  ns/op
>
> Here, "sup" means all supplementary characters, BMP otherwise. "lower"
> means each character requires upper->lower casemap. "upperLower" means
> all characters are the same, except the last character which requires
> casemap.
>
> I think the result is reasonable, considering surrogates check are now
> mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but
> it did not seem to benefit here. In fact, the performance degraded
> partly because I implemented the short cut, and possibly for the
> overhead of extra checks.
>
> Naoto
>
> On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:
> > Hello,
> >
> > Please review the fix to the following issues:
> >
> > https://bugs.openjdk.java.net/browse/JDK-8248655
> > https://bugs.openjdk.java.net/browse/JDK-8248434
> >
> > The proposed changeset and its CSR are located at:
> >
> > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> > https://bugs.openjdk.java.net/browse/JDK-8248664
> >
> > A bug was filed against SimpleDateFormat (8248434) where
> > case-insensitive date format/parse failed in some of the new locales in
> > JDK15. The root cause was that case-insensitive String.regionMatches()
> > method did not work with supplementary characters. The problem is that
> > the method's spec does not expect case mappings of supplementary
> > characters, possibly because it was overlooked in the first place, JSR
> > 204 - "Unicode Supplementary Character support". Similar behavior is
> > observed in other two case-insensitive methods, i.e.,
> > compareToIgnoreCase() and equalsIgnoreCase().
> >
> > The fix is straightforward to compare strings by code point basis,
> > instead of code unit (16bit "char") basis. Technically this change will
> > introduce a backward incompatibility, but I believe it is an
> > incompatibility to wrong behavior, not true to the meaning of those
> > methods' expectations.
> >
> > Naoto
>


Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread Joe Wang




On 7/20/2020 8:58 PM, naoto.s...@oracle.com wrote:
The short-cut worked well. There's maybe a further optimization we 
could do to rid us of the performance concern (or the need to run 
additional performance tests). Consider the cases where strings in 
comparison don't contain any sup characters, if we make the 
toLower/UpperCase() block a method and call it before and after the 
surrogate-check block, the routine would be effectively the same as 
prior to the introduction of the surrogate-check block, and regular 
comparisons would suffer the surrogate-check only once (the last 
check). For strings that do contain sup characters then, the 
toLower/UpperCase() method would have been called twice, but then we 
don't care about the performance in that situation. You may call the 
existing codePointAt method too when an extra getChar and performance 
is not a concern (but that's your call.


Can you please elaborate this more? What's "the last check" here?


What I meant was that we could expand the 'short-cut' from case 
sensitive to case insensitive, that is in addition to the line 337, do 
that line 353 - 370 case-insensitive check as well.


I guess it can be explained better with code. I added inline comment:

    for (int k1 = toffset, k2 = ooffset; k1 < tlast && k2 < olast; 
k1++, k2++) {

    int cp1 = (int)getChar(value, k1);
    int cp2 = (int)getChar(other, k2);

// does a case-insensitive check:

    if (checkEqual(cp1, cp2) == 0) {
    continue;
    }

// this block will be run once for strings that don't contain any 
supplementary characters


 // Check for supplementary characters case
    cp1 = getSupplementaryCodePoint(value, cp1, k1, toffset, 
tlast);

    if ((cp1 & Integer.MIN_VALUE) != 0) {
    k1++;
    cp1 ^= Integer.MIN_VALUE;
    }
    cp2 = getSupplementaryCodePoint(other, cp2, k2, ooffset, 
olast);

    if ((cp2 & Integer.MIN_VALUE) != 0) {
    k2++;
    cp2 ^= Integer.MIN_VALUE;
    }


// thischeck will have been called twice for strings that contain 
supplementary characters

// but only one more for strings that don't

    int diff = checkEqual(cp1, cp2);
    if (diff != 0) {
    return diff;
    }
    }
    return tlen - olen;
    }

// the code block between line 353 - 370 in webrev.04 except the last 
line (return 0):

    private static int checkEqual(int cp1, int cp2) {
    if (cp1 != cp2) {
    // try converting both characters to uppercase.
    // If the results match, then the comparison scan should
    // continue.
    cp1 = Character.toUpperCase(cp1);
    cp2 = Character.toUpperCase(cp2);
    if (cp1 != cp2) {
    // Unfortunately, conversion to uppercase does not work 
properly
    // for the Georgian alphabet, which has strange rules 
about case

    // conversion.  So we need to make one last check before
    // exiting.
    cp1 = Character.toLowerCase(cp1);
    cp2 = Character.toLowerCase(cp2);
    if (cp1 != cp2) {
    return cp1 - cp2;
    }
    }
    }
    return 0;
    }





Naoto 




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-21 Thread Roger Riggs

Hi Naoto,

StringUTF16.java:

343, 348: The masking and comparisons seem awkward.
I'd suggest just negating the value and testing for < 0 to flag the less 
usual case.
If you continue with the flag bit, define your own static final constant 
for that bit;

It looks odd to be using Integer.MIN_VALUE with bit operators.

Performance wise, I'm a bit cautious about all the if's and branches;

And only calling getSupplementaryCodePoint if cp was a suggorate might 
have some

benefit, but its hard to tell what will be inlined and when.

Thanks, Roger



On 7/20/20 6:20 PM, naoto.s...@oracle.com wrote:

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead 
of adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). 
Since we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it 
would be considered a regression. I would suggest you run the 
specJVM. I agree with you on surrogate check being a requirement, 
thus supLower being 139% slower is ok since it won't otherwise be 
correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, 
but the test suite is aimed at the entire application performance. 
But for this one, it is a micro benchmark for relatively rarely 
issued methods (I would think normal cases fall into Latin1 
equivalents), I would consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very 
long sample strings to see if we can reduce the noise level and also 
see how it fares for a longer string. I assume the machine you're 
running the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that 
are almost identical, but one last character is case insensitively 
equal. So in these cases, the new short cut works really well and not 
call toLower/UpperCase() at all for most of the characters. Thus the 
new results are faster. Again the test result is very dependent on 
the input data, Unless the result showed 100% slower or something 
(except supLower case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the 
slower codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not 
need the complication.


The implementation is dealing with bare byte arrays. Only methods 
that it uses from Character class are toLowerCase(int) and 
toUpperCase(int) (sans surrogate check, which is needed at each 
iteration anyways), and their "char" equivalents are merely casting 
(char) to the int result. So it might not be so beneficial to 
differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), 
so I revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and 
regionMatchesCI()

    - Enabled immediate short cut if two code points match.
    - Created a simple JMH benchmark. Here is the scores before 
and after

    the change:

    before:
    Benchmark                                Mode  Cnt  Score 
 Error     Units
    StringCompareToIgnoreCase.lower          avgt   25 53.764 ? 
2.811     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25 24.211 ? 
1.135     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25 30.595 ? 
1.344     

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-20 Thread naoto . sato

Hi Joe,

On 7/20/20 7:14 PM, Joe Wang wrote:

Hi Naoto,

"Unless it showed 100% slower", wow, your tolerance is quite high :-). 
On the other hand, I do agree it's unlikely to show in specJVM (that's a 
speculation though).


I am not saying 100% slowing is permissible :-) That's an example of 
definite no.




The short-cut worked well. There's maybe a further optimization we could 
do to rid us of the performance concern (or the need to run additional 
performance tests). Consider the cases where strings in comparison don't 
contain any sup characters, if we make the toLower/UpperCase() block a 
method and call it before and after the surrogate-check block, the 
routine would be effectively the same as prior to the introduction of 
the surrogate-check block, and regular comparisons would suffer the 
surrogate-check only once (the last check). For strings that do contain 
sup characters then, the toLower/UpperCase() method would have been 
called twice, but then we don't care about the performance in that 
situation. You may call the existing codePointAt method too when an 
extra getChar and performance is not a concern (but that's your call.


Can you please elaborate this more? What's "the last check" here?

Naoto



Regards,
Joe

On 7/20/20 3:20 PM, naoto.s...@oracle.com wrote:

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead 
of adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). 
Since we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it 
would be considered a regression. I would suggest you run the 
specJVM. I agree with you on surrogate check being a requirement, 
thus supLower being 139% slower is ok since it won't otherwise be 
correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, 
but the test suite is aimed at the entire application performance. 
But for this one, it is a micro benchmark for relatively rarely 
issued methods (I would think normal cases fall into Latin1 
equivalents), I would consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very 
long sample strings to see if we can reduce the noise level and also 
see how it fares for a longer string. I assume the machine you're 
running the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that 
are almost identical, but one last character is case insensitively 
equal. So in these cases, the new short cut works really well and not 
call toLower/UpperCase() at all for most of the characters. Thus the 
new results are faster. Again the test result is very dependent on 
the input data, Unless the result showed 100% slower or something 
(except supLower case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the 
slower codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not 
need the complication.


The implementation is dealing with bare byte arrays. Only methods 
that it uses from Character class are toLowerCase(int) and 
toUpperCase(int) (sans surrogate check, which is needed at each 
iteration anyways), and their "char" equivalents are merely casting 
(char) to the int result. So it might not be so beneficial to 
differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), 
so I revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-20 Thread Joe Wang

Hi Naoto,

"Unless it showed 100% slower", wow, your tolerance is quite high :-). 
On the other hand, I do agree it's unlikely to show in specJVM (that's a 
speculation though).


The short-cut worked well. There's maybe a further optimization we could 
do to rid us of the performance concern (or the need to run additional 
performance tests). Consider the cases where strings in comparison don't 
contain any sup characters, if we make the toLower/UpperCase() block a 
method and call it before and after the surrogate-check block, the 
routine would be effectively the same as prior to the introduction of 
the surrogate-check block, and regular comparisons would suffer the 
surrogate-check only once (the last check). For strings that do contain 
sup characters then, the toLower/UpperCase() method would have been 
called twice, but then we don't care about the performance in that 
situation. You may call the existing codePointAt method too when an 
extra getChar and performance is not a concern (but that's your call.


Regards,
Joe

On 7/20/20 3:20 PM, naoto.s...@oracle.com wrote:

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead 
of adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). 
Since we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it 
would be considered a regression. I would suggest you run the 
specJVM. I agree with you on surrogate check being a requirement, 
thus supLower being 139% slower is ok since it won't otherwise be 
correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, 
but the test suite is aimed at the entire application performance. 
But for this one, it is a micro benchmark for relatively rarely 
issued methods (I would think normal cases fall into Latin1 
equivalents), I would consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very 
long sample strings to see if we can reduce the noise level and also 
see how it fares for a longer string. I assume the machine you're 
running the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that 
are almost identical, but one last character is case insensitively 
equal. So in these cases, the new short cut works really well and not 
call toLower/UpperCase() at all for most of the characters. Thus the 
new results are faster. Again the test result is very dependent on 
the input data, Unless the result showed 100% slower or something 
(except supLower case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the 
slower codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not 
need the complication.


The implementation is dealing with bare byte arrays. Only methods 
that it uses from Character class are toLowerCase(int) and 
toUpperCase(int) (sans surrogate check, which is needed at each 
iteration anyways), and their "char" equivalents are merely casting 
(char) to the int result. So it might not be so beneficial to 
differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), 
so I revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-20 Thread naoto . sato

Small correction in the updated part:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.04/

Naoto

On 7/20/20 2:39 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get 
there if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead of 
adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). Since 
we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it would 
be considered a regression. I would suggest you run the specJVM. I 
agree with you on surrogate check being a requirement, thus supLower 
being 139% slower is ok since it won't otherwise be correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, but 
the test suite is aimed at the entire application performance. But for 
this one, it is a micro benchmark for relatively rarely issued methods 
(I would think normal cases fall into Latin1 equivalents), I would 
consider it is acceptable.


But after introducing additional operations supUpperLower and 
upperLower ran faster? That may indicate irregularity in the tests. 
Maybe we should consider running tests with short, long and very long 
sample strings to see if we can reduce the noise level and also see 
how it fares for a longer string. I assume the machine you're running 
the test on was isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that are 
almost identical, but one last character is case insensitively equal. So 
in these cases, the new short cut works really well and not call 
toLower/UpperCase() at all for most of the characters. Thus the new 
results are faster. Again the test result is very dependent on the input 
data, Unless the result showed 100% slower or something (except supLower 
case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to 
the point where you hit a high surrogate, then drop into the slower 
codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not need 
the complication.


The implementation is dealing with bare byte arrays. Only methods 
that it uses from Character class are toLowerCase(int) and 
toUpperCase(int) (sans surrogate check, which is needed at each 
iteration anyways), and their "char" equivalents are merely casting 
(char) to the int result. So it might not be so beneficial to 
differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), so 
I revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and 
regionMatchesCI()

    - Enabled immediate short cut if two code points match.
    - Created a simple JMH benchmark. Here is the scores before and 
after

    the change:

    before:
    Benchmark                                Mode  Cnt   Score 
 Error     Units
    StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 
2.811     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 
1.135     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 
1.344     ns/op
    StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 
1.499     ns/op


    after:
    Benchmark                                Mode  Cnt   Score 
 Error     Units
    StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 
4.603     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 
5.672     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 
0.965     ns/op
    StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 
0.272     ns/op


    Here, "sup" means all supplementary characters, BMP otherwise. 
"lower"
    means each 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-20 Thread naoto . sato

Hi Joe,

Thank you for your comments.

On 7/20/20 11:20 AM, Joe Wang wrote:

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get there 
if 389:isHighSurrogate is not true. 


Good point.

But more importantly, StringUTF16
has existing method "codePointAt" you may want to consider instead of 
adding a new method.


If we call codePointAt/Before, it would call an extra getChar(). Since 
we know one codepoint as an input, I would avoid the extra calls.




Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it would be 
considered a regression. I would suggest you run the specJVM. I agree 
with you on surrogate check being a requirement, thus supLower being 
139% slower is ok since it won't otherwise be correct anyways.


Yes, it would be a regression if SPECjvm produces 8+% degradation, but 
the test suite is aimed at the entire application performance. But for 
this one, it is a micro benchmark for relatively rarely issued methods 
(I would think normal cases fall into Latin1 equivalents), I would 
consider it is acceptable.


But after 
introducing additional operations supUpperLower and upperLower ran 
faster? That may indicate irregularity in the tests. Maybe we should 
consider running tests with short, long and very long sample strings to 
see if we can reduce the noise level and also see how it fares for a 
longer string. I assume the machine you're running the test on was 
isolated.


This result pretty much depends on the data it is testing for. As I 
wrote in the previous email, (sup)UpperLower tests use the data that are 
almost identical, but one last character is case insensitively equal. So 
in these cases, the new short cut works really well and not call 
toLower/UpperCase() at all for most of the characters. Thus the new 
results are faster. Again the test result is very dependent on the input 
data, Unless the result showed 100% slower or something (except supLower 
case), I think it is OK.


Anyways, here is the updated webrev addressing your first suggestion:

http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.03/

Naoto



Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to the 
point where you hit a high surrogate, then drop into the slower 
codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not need 
the complication.


The implementation is dealing with bare byte arrays. Only methods that 
it uses from Character class are toLowerCase(int) and toUpperCase(int) 
(sans surrogate check, which is needed at each iteration anyways), and 
their "char" equivalents are merely casting (char) to the int result. 
So it might not be so beneficial to differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), so I 
revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and 
regionMatchesCI()

    - Enabled immediate short cut if two code points match.
    - Created a simple JMH benchmark. Here is the scores before and 
after

    the change:

    before:
    Benchmark                                Mode  Cnt   Score  Error 
    Units
    StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 
2.811     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 
1.135     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 
1.344     ns/op
    StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 
1.499     ns/op


    after:
    Benchmark                                Mode  Cnt   Score  Error 
    Units
    StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 
4.603     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 
5.672     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 
0.965     ns/op
    StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 
0.272     ns/op


    Here, "sup" means all supplementary characters, BMP otherwise. 
"lower"
    means each character requires upper->lower casemap. "upperLower" 
means
    all characters are the same, except the last character which 
requires

    casemap.

    I 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-20 Thread Joe Wang

Hi Naoto,

StringUTF16: line 384 - 388 seem unnecessary since you'd only get there 
if 389:isHighSurrogate is not true. But more importantly, StringUTF16 
has existing method "codePointAt" you may want to consider instead of 
adding a new method.


Comparing to the base benchmark:
StringCompareToIgnoreCase.lower  8.5%
StringCompareToIgnoreCase.supLower  139%
StringCompareToIgnoreCase.supUpperLower  -21.8%
StringCompareToIgnoreCase.upperLower avgt   -5.9%


"lower" was 8.5% slower, if such test exists in the specJVM, it would be 
considered a regression. I would suggest you run the specJVM. I agree 
with you on surrogate check being a requirement, thus supLower being 
139% slower is ok since it won't otherwise be correct anyways. But after 
introducing additional operations supUpperLower and upperLower ran 
faster? That may indicate irregularity in the tests. Maybe we should 
consider running tests with short, long and very long sample strings to 
see if we can reduce the noise level and also see how it fares for a 
longer string. I assume the machine you're running the test on was isolated.


Regards,
Joe

On 7/19/2020 11:05 AM, naoto.s...@oracle.com wrote:

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to the 
point where you hit a high surrogate, then drop into the slower 
codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not need 
the complication.


The implementation is dealing with bare byte arrays. Only methods that 
it uses from Character class are toLowerCase(int) and toUpperCase(int) 
(sans surrogate check, which is needed at each iteration anyways), and 
their "char" equivalents are merely casting (char) to the int result. 
So it might not be so beneficial to differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate 
check (always checks high *and* low surrogate on each iteration), so I 
revised the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


    Hi,

    Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

    Changes from the initial revision are:

    - Shared the implementation between compareToCI() and 
regionMatchesCI()

    - Enabled immediate short cut if two code points match.
    - Created a simple JMH benchmark. Here is the scores before and 
after

    the change:

    before:
    Benchmark                                Mode  Cnt   Score  Error 
    Units
    StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 
2.811     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 
1.135     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 
1.344     ns/op
    StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 
1.499     ns/op


    after:
    Benchmark                                Mode  Cnt   Score  Error 
    Units
    StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 
4.603     ns/op
    StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 
5.672     ns/op
    StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 
0.965     ns/op
    StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 
0.272     ns/op


    Here, "sup" means all supplementary characters, BMP otherwise. 
"lower"
    means each character requires upper->lower casemap. "upperLower" 
means
    all characters are the same, except the last character which 
requires

    casemap.

    I think the result is reasonable, considering surrogates check 
are now
    mandatory. I have tried Roger's suggestion to use 
Arrays.mismatch() but

    it did not seem to benefit here. In fact, the performance degraded
    partly because I implemented the short cut, and possibly for the
    overhead of extra checks.

    Naoto

    On 7/15/20 9:00 AM, naoto.s...@oracle.com
     wrote:
 > Hello,
 >
 > Please review the fix to the following issues:
 >
 > https://bugs.openjdk.java.net/browse/JDK-8248655
 > https://bugs.openjdk.java.net/browse/JDK-8248434
 >
 > The proposed changeset and its CSR are located at:
 >
 > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
 > https://bugs.openjdk.java.net/browse/JDK-8248664
 >
 > A bug was filed against SimpleDateFormat (8248434) where
 > case-insensitive date format/parse failed in some of the new
    locales in
 > JDK15. The root cause was that case-insensitive
    String.regionMatches()
 > method did not work with supplementary characters. The problem is
    that
 > the method's spec does not expect case mappings of supplementary
 > characters, 

Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-19 Thread naoto . sato

Hi Mark,

Thank you for your comments.

On 7/17/20 8:03 PM, Mark Davis ☕ wrote:
One option is to have a fast path that uses char functions, up to the 
point where you hit a high surrogate, then drop into the slower 
codepoint path. That saves time for the high-runner cases.


On the other hand, if the times are good enough, you might not need the 
complication.


The implementation is dealing with bare byte arrays. Only methods that 
it uses from Character class are toLowerCase(int) and toUpperCase(int) 
(sans surrogate check, which is needed at each iteration anyways), and 
their "char" equivalents are merely casting (char) to the int result. So 
it might not be so beneficial to differentiate char and int paths.


Having said that, I found that there was an unnecessary surrogate check 
(always checks high *and* low surrogate on each iteration), so I revised 
the fix (added line 380-383 in StringUTF16.java):


http://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.02/

Naoto



Mark
//


On Fri, Jul 17, 2020 at 4:39 PM > wrote:


Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after
the change:

before:
Benchmark                                Mode  Cnt   Score   Error 
Units
StringCompareToIgnoreCase.lower          avgt   25  53.764 ? 2.811 
ns/op
StringCompareToIgnoreCase.supLower       avgt   25  24.211 ? 1.135 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344 
ns/op
StringCompareToIgnoreCase.upperLower     avgt   25  18.859 ? 1.499 
ns/op


after:
Benchmark                                Mode  Cnt   Score   Error 
Units
StringCompareToIgnoreCase.lower          avgt   25  58.354 ? 4.603 
ns/op
StringCompareToIgnoreCase.supLower       avgt   25  57.975 ? 5.672 
ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965 
ns/op
StringCompareToIgnoreCase.upperLower     avgt   25  17.744 ? 0.272 
ns/op


Here, "sup" means all supplementary characters, BMP otherwise. "lower"
means each character requires upper->lower casemap. "upperLower" means
all characters are the same, except the last character which requires
casemap.

I think the result is reasonable, considering surrogates check are now
mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but
it did not seem to benefit here. In fact, the performance degraded
partly because I implemented the short cut, and possibly for the
overhead of extra checks.

Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com
 wrote:
 > Hello,
 >
 > Please review the fix to the following issues:
 >
 > https://bugs.openjdk.java.net/browse/JDK-8248655
 > https://bugs.openjdk.java.net/browse/JDK-8248434
 >
 > The proposed changeset and its CSR are located at:
 >
 > https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
 > https://bugs.openjdk.java.net/browse/JDK-8248664
 >
 > A bug was filed against SimpleDateFormat (8248434) where
 > case-insensitive date format/parse failed in some of the new
locales in
 > JDK15. The root cause was that case-insensitive
String.regionMatches()
 > method did not work with supplementary characters. The problem is
that
 > the method's spec does not expect case mappings of supplementary
 > characters, possibly because it was overlooked in the first
place, JSR
 > 204 - "Unicode Supplementary Character support". Similar behavior is
 > observed in other two case-insensitive methods, i.e.,
 > compareToIgnoreCase() and equalsIgnoreCase().
 >
 > The fix is straightforward to compare strings by code point basis,
 > instead of code unit (16bit "char") basis. Technically this
change will
 > introduce a backward incompatibility, but I believe it is an
 > incompatibility to wrong behavior, not true to the meaning of those
 > methods' expectations.
 >
 > Naoto



Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-17 Thread naoto . sato

Hi,

Based on the suggestions, I modified the fix as follows:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.01/

Changes from the initial revision are:

- Shared the implementation between compareToCI() and regionMatchesCI()
- Enabled immediate short cut if two code points match.
- Created a simple JMH benchmark. Here is the scores before and after 
the change:


before:
BenchmarkMode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower  avgt   25  53.764 ? 2.811  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  24.211 ? 1.135  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  30.595 ? 1.344  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  18.859 ? 1.499  ns/op

after:
BenchmarkMode  Cnt   Score   Error  Units
StringCompareToIgnoreCase.lower  avgt   25  58.354 ? 4.603  ns/op
StringCompareToIgnoreCase.supLower   avgt   25  57.975 ? 5.672  ns/op
StringCompareToIgnoreCase.supUpperLower  avgt   25  23.912 ? 0.965  ns/op
StringCompareToIgnoreCase.upperLower avgt   25  17.744 ? 0.272  ns/op

Here, "sup" means all supplementary characters, BMP otherwise. "lower" 
means each character requires upper->lower casemap. "upperLower" means 
all characters are the same, except the last character which requires 
casemap.


I think the result is reasonable, considering surrogates check are now 
mandatory. I have tried Roger's suggestion to use Arrays.mismatch() but 
it did not seem to benefit here. In fact, the performance degraded 
partly because I implemented the short cut, and possibly for the 
overhead of extra checks.


Naoto

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales in 
JDK15. The root cause was that case-insensitive String.regionMatches() 
method did not work with supplementary characters. The problem is that 
the method's spec does not expect case mappings of supplementary 
characters, possibly because it was overlooked in the first place, JSR 
204 - "Unicode Supplementary Character support". Similar behavior is 
observed in other two case-insensitive methods, i.e., 
compareToIgnoreCase() and equalsIgnoreCase().


The fix is straightforward to compare strings by code point basis, 
instead of code unit (16bit "char") basis. Technically this change will 
introduce a backward incompatibility, but I believe it is an 
incompatibility to wrong behavior, not true to the meaning of those 
methods' expectations.


Naoto


Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread James Laskey




> On Jul 15, 2020, at 4:32 PM, Joe Wang  wrote:
> 
> Jim: I was referring to the rest of the process (the calls to 
> toCodePoint/toUpperCase/toLowerCase), not isHighSurrogate itself. But Roger 
> has a more comprehensive review on performance, and Naoto is planning on 
> preparing a JMH test.
> 
> -Joe
> 
>> On 7/15/2020 11:46 AM, Jim Laskey wrote:
>> Joe: This is a defensive approach that I believe has minimal cost.
>> 
>> public static boolean isHighSurrogate(char ch) {
>> // Help VM constant-fold; MAX_HIGH_SURROGATE + 1 == MIN_LOW_SURROGATE
>> return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
>> }
>> 
>> 
 On Jul 15, 2020, at 3:32 PM, naoto.s...@oracle.com wrote:
>>> 
>>> Hi Joe,
>>> 
>>> Thank you for your review.
>>> 
>>> On 7/15/20 10:57 AM, Joe Wang wrote:
 Hi Naoto,
 In StringUTF16.java, if one is isHighSurrogate and the other not, you may 
 quickly return without going through the rest of the process, probably not 
 significant as cp1 and cp2 and/or u1 and u2 won't be equal anyways. But it 
 could skip a couple of toCodePoint/toUpperCase/toLowerCase calls.
>>> Yes, that is correct as of now, which is based on the assumption that case 
>>> mappings do not cross BMP and supplementary planes boundary. I could not 
>>> find any description where that's given or not. So I just took it to be 
>>> safe.
>>> 
>>> Naoto
>>> 
 -Joe
 On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:
> Hello,
> 
> Please review the fix to the following issues:
> 
> https://bugs.openjdk.java.net/browse/JDK-8248655
> https://bugs.openjdk.java.net/browse/JDK-8248434
> 
> The proposed changeset and its CSR are located at:
> 
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8248664
> 
> A bug was filed against SimpleDateFormat (8248434) where case-insensitive 
> date format/parse failed in some of the new locales in JDK15. The root 
> cause was that case-insensitive String.regionMatches() method did not 
> work with supplementary characters. The problem is that the method's spec 
> does not expect case mappings of supplementary characters, possibly 
> because it was overlooked in the first place, JSR 204 - "Unicode 
> Supplementary Character support". Similar behavior is observed in other 
> two case-insensitive methods, i.e., compareToIgnoreCase() and 
> equalsIgnoreCase().
> 
> The fix is straightforward to compare strings by code point basis, 
> instead of code unit (16bit "char") basis. Technically this change will 
> introduce a backward incompatibility, but I believe it is an 
> incompatibility to wrong behavior, not true to the meaning of those 
> methods' expectations.
> 
> Naoto
> 



Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread Joe Wang
Jim: I was referring to the rest of the process (the calls to 
toCodePoint/toUpperCase/toLowerCase), not isHighSurrogate itself. But 
Roger has a more comprehensive review on performance, and Naoto is 
planning on preparing a JMH test.


-Joe

On 7/15/2020 11:46 AM, Jim Laskey wrote:

Joe: This is a defensive approach that I believe has minimal cost.

 public static boolean isHighSurrogate(char ch) {
 // Help VM constant-fold; MAX_HIGH_SURROGATE + 1 == MIN_LOW_SURROGATE
 return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
 }



On Jul 15, 2020, at 3:32 PM, naoto.s...@oracle.com wrote:

Hi Joe,

Thank you for your review.

On 7/15/20 10:57 AM, Joe Wang wrote:

Hi Naoto,
In StringUTF16.java, if one is isHighSurrogate and the other not, you may 
quickly return without going through the rest of the process, probably not 
significant as cp1 and cp2 and/or u1 and u2 won't be equal anyways. But it 
could skip a couple of toCodePoint/toUpperCase/toLowerCase calls.

Yes, that is correct as of now, which is based on the assumption that case 
mappings do not cross BMP and supplementary planes boundary. I could not find 
any description where that's given or not. So I just took it to be safe.

Naoto


-Joe
On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where case-insensitive date 
format/parse failed in some of the new locales in JDK15. The root cause was that 
case-insensitive String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case mappings of 
supplementary characters, possibly because it was overlooked in the first place, JSR 204 
- "Unicode Supplementary Character support". Similar behavior is observed in 
other two case-insensitive methods, i.e., compareToIgnoreCase() and equalsIgnoreCase().

The fix is straightforward to compare strings by code point basis, instead of code unit 
(16bit "char") basis. Technically this change will introduce a backward 
incompatibility, but I believe it is an incompatibility to wrong behavior, not true to 
the meaning of those methods' expectations.

Naoto




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread Jim Laskey
Joe: This is a defensive approach that I believe has minimal cost.

public static boolean isHighSurrogate(char ch) {
// Help VM constant-fold; MAX_HIGH_SURROGATE + 1 == MIN_LOW_SURROGATE
return ch >= MIN_HIGH_SURROGATE && ch < (MAX_HIGH_SURROGATE + 1);
}


> On Jul 15, 2020, at 3:32 PM, naoto.s...@oracle.com wrote:
> 
> Hi Joe,
> 
> Thank you for your review.
> 
> On 7/15/20 10:57 AM, Joe Wang wrote:
>> Hi Naoto,
>> In StringUTF16.java, if one is isHighSurrogate and the other not, you may 
>> quickly return without going through the rest of the process, probably not 
>> significant as cp1 and cp2 and/or u1 and u2 won't be equal anyways. But it 
>> could skip a couple of toCodePoint/toUpperCase/toLowerCase calls.
> 
> Yes, that is correct as of now, which is based on the assumption that case 
> mappings do not cross BMP and supplementary planes boundary. I could not find 
> any description where that's given or not. So I just took it to be safe.
> 
> Naoto
> 
>> -Joe
>> On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:
>>> Hello,
>>> 
>>> Please review the fix to the following issues:
>>> 
>>> https://bugs.openjdk.java.net/browse/JDK-8248655
>>> https://bugs.openjdk.java.net/browse/JDK-8248434
>>> 
>>> The proposed changeset and its CSR are located at:
>>> 
>>> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
>>> https://bugs.openjdk.java.net/browse/JDK-8248664
>>> 
>>> A bug was filed against SimpleDateFormat (8248434) where case-insensitive 
>>> date format/parse failed in some of the new locales in JDK15. The root 
>>> cause was that case-insensitive String.regionMatches() method did not work 
>>> with supplementary characters. The problem is that the method's spec does 
>>> not expect case mappings of supplementary characters, possibly because it 
>>> was overlooked in the first place, JSR 204 - "Unicode Supplementary 
>>> Character support". Similar behavior is observed in other two 
>>> case-insensitive methods, i.e., compareToIgnoreCase() and 
>>> equalsIgnoreCase().
>>> 
>>> The fix is straightforward to compare strings by code point basis, instead 
>>> of code unit (16bit "char") basis. Technically this change will introduce a 
>>> backward incompatibility, but I believe it is an incompatibility to wrong 
>>> behavior, not true to the meaning of those methods' expectations.
>>> 
>>> Naoto



Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread naoto . sato

Hi Joe,

Thank you for your review.

On 7/15/20 10:57 AM, Joe Wang wrote:

Hi Naoto,

In StringUTF16.java, if one is isHighSurrogate and the other not, you 
may quickly return without going through the rest of the process, 
probably not significant as cp1 and cp2 and/or u1 and u2 won't be equal 
anyways. But it could skip a couple of 
toCodePoint/toUpperCase/toLowerCase calls.


Yes, that is correct as of now, which is based on the assumption that 
case mappings do not cross BMP and supplementary planes boundary. I 
could not find any description where that's given or not. So I just took 
it to be safe.


Naoto



-Joe

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales 
in JDK15. The root cause was that case-insensitive 
String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case 
mappings of supplementary characters, possibly because it was 
overlooked in the first place, JSR 204 - "Unicode Supplementary 
Character support". Similar behavior is observed in other two 
case-insensitive methods, i.e., compareToIgnoreCase() and 
equalsIgnoreCase().


The fix is straightforward to compare strings by code point basis, 
instead of code unit (16bit "char") basis. Technically this change 
will introduce a backward incompatibility, but I believe it is an 
incompatibility to wrong behavior, not true to the meaning of those 
methods' expectations.


Naoto




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread naoto . sato

Hi Roger,

Thank you for your review and suggestions.

On 7/15/20 10:56 AM, Roger Riggs wrote:

Hi Naoto,

Given the extra tests in the body of the loop, I think its worth finding 
or creating

a JMH test for this and checking the performance.


Good point.



With performance in mind, I would try to fall back to the UC/LC 
conversions only

when the bytes don't match.  See java.util.Arrays.mismatch(byte[], byte[]).

It might even be worth finding the mismatch in the byte arrays before even
starting to look at the characters.


Agree.



There's also an option to assemble 4 bytes at a time and compare the int's.
If they are equal you are ahead of the game.  If not, back off to comparing
the characters and checking for surrogates.  The backoff code will be a bit
messier though.


Sure int comparison would be faster, I am not quite sure that would be 
worth compared to more complicated recovery code though.




Also, compareToCI and regionMatchesCI could share the implementation of 
the inner loop.


Yes.



If k1 and k2 ever get out of sync, isn't that failed assertion, so why 
have two indexes.


This is for preventing possible case maps where one in BMP and the other 
in supplementary, thus the indices could be different.




The loop will have fewer checks against the length of it processes len-1 
chars

and then have a check if there is a final char to be checked.
it can always know there is another char and can blindly get it.


Maybe, but could be complicated if the last char is a low surrogate.

Anyway, I will come up with a JMH test case and revise the webrev.

Naoto



Regards, Roger


On 7/15/20 12:00 PM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales 
in JDK15. The root cause was that case-insensitive 
String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case 
mappings of supplementary characters, possibly because it was 
overlooked in the first place, JSR 204 - "Unicode Supplementary 
Character support". Similar behavior is observed in other two 
case-insensitive methods, i.e., compareToIgnoreCase() and 
equalsIgnoreCase().


The fix is straightforward to compare strings by code point basis, 
instead of code unit (16bit "char") basis. Technically this change 
will introduce a backward incompatibility, but I believe it is an 
incompatibility to wrong behavior, not true to the meaning of those 
methods' expectations.


Naoto




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread Joe Wang

Hi Naoto,

In StringUTF16.java, if one is isHighSurrogate and the other not, you 
may quickly return without going through the rest of the process, 
probably not significant as cp1 and cp2 and/or u1 and u2 won't be equal 
anyways. But it could skip a couple of 
toCodePoint/toUpperCase/toLowerCase calls.


-Joe

On 7/15/20 9:00 AM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales 
in JDK15. The root cause was that case-insensitive 
String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case 
mappings of supplementary characters, possibly because it was 
overlooked in the first place, JSR 204 - "Unicode Supplementary 
Character support". Similar behavior is observed in other two 
case-insensitive methods, i.e., compareToIgnoreCase() and 
equalsIgnoreCase().


The fix is straightforward to compare strings by code point basis, 
instead of code unit (16bit "char") basis. Technically this change 
will introduce a backward incompatibility, but I believe it is an 
incompatibility to wrong behavior, not true to the meaning of those 
methods' expectations.


Naoto




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread Roger Riggs

Hi Naoto,

Given the extra tests in the body of the loop, I think its worth finding 
or creating

a JMH test for this and checking the performance.

With performance in mind, I would try to fall back to the UC/LC 
conversions only

when the bytes don't match.  See java.util.Arrays.mismatch(byte[], byte[]).

It might even be worth finding the mismatch in the byte arrays before even
starting to look at the characters.

There's also an option to assemble 4 bytes at a time and compare the int's.
If they are equal you are ahead of the game.  If not, back off to comparing
the characters and checking for surrogates.  The backoff code will be a bit
messier though.

Also, compareToCI and regionMatchesCI could share the implementation of 
the inner loop.


If k1 and k2 ever get out of sync, isn't that failed assertion, so why 
have two indexes.


The loop will have fewer checks against the length of it processes len-1 
chars

and then have a check if there is a final char to be checked.
it can always know there is another char and can blindly get it.

Regards, Roger


On 7/15/20 12:00 PM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where 
case-insensitive date format/parse failed in some of the new locales 
in JDK15. The root cause was that case-insensitive 
String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case 
mappings of supplementary characters, possibly because it was 
overlooked in the first place, JSR 204 - "Unicode Supplementary 
Character support". Similar behavior is observed in other two 
case-insensitive methods, i.e., compareToIgnoreCase() and 
equalsIgnoreCase().


The fix is straightforward to compare strings by code point basis, 
instead of code unit (16bit "char") basis. Technically this change 
will introduce a backward incompatibility, but I believe it is an 
incompatibility to wrong behavior, not true to the meaning of those 
methods' expectations.


Naoto




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread naoto . sato

Thank you, Jim, for the quick review!

On 7/15/20 9:26 AM, Jim Laskey wrote:

I think I'm good with this. +1

Asides:

  325 int cp1 = (int)getChar(value, k1);
  326 int cp2 = (int)getChar(other, k2);

I would be tempted to short cut by exiting when not equal, but I think we 
agreed we need to allow for upper/lowers on different planes.
  
In the UTF-16 code I was trying to think of how your could exhaust the first string and not the second, and still have their lengths the same. I think I have convinced myself that it's not possible as long as surrogates always map upper/lowers to surrogates (two chars each.)


Right. All code points as of JDK15/6 is in the same plane, thus the 
lengths won't change. I was trying to create a test case for that 
hypothetical situation, but gave up because each character case map is 
embedded in Unicode Character Database, which cannot be modified.


Naoto



Cheers,

-- Jim






On Jul 15, 2020, at 1:00 PM, naoto.s...@oracle.com wrote:

Hello,

Please review the fix to the following issues:

https://bugs.openjdk.java.net/browse/JDK-8248655
https://bugs.openjdk.java.net/browse/JDK-8248434

The proposed changeset and its CSR are located at:

https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8248664

A bug was filed against SimpleDateFormat (8248434) where case-insensitive date 
format/parse failed in some of the new locales in JDK15. The root cause was that 
case-insensitive String.regionMatches() method did not work with supplementary 
characters. The problem is that the method's spec does not expect case mappings of 
supplementary characters, possibly because it was overlooked in the first place, JSR 204 
- "Unicode Supplementary Character support". Similar behavior is observed in 
other two case-insensitive methods, i.e., compareToIgnoreCase() and equalsIgnoreCase().

The fix is straightforward to compare strings by code point basis, instead of code unit 
(16bit "char") basis. Technically this change will introduce a backward 
incompatibility, but I believe it is an incompatibility to wrong behavior, not true to 
the meaning of those methods' expectations.

Naoto




Re: RFR: 8248655: Support supplementary characters in String case insensitive operations

2020-07-15 Thread Jim Laskey
I think I'm good with this. +1

Asides:

 325 int cp1 = (int)getChar(value, k1);
 326 int cp2 = (int)getChar(other, k2);

I would be tempted to short cut by exiting when not equal, but I think we 
agreed we need to allow for upper/lowers on different planes.
 
In the UTF-16 code I was trying to think of how your could exhaust the first 
string and not the second, and still have their lengths the same. I think I 
have convinced myself that it's not possible as long as surrogates always map 
upper/lowers to surrogates (two chars each.)

Cheers,

-- Jim





> On Jul 15, 2020, at 1:00 PM, naoto.s...@oracle.com wrote:
> 
> Hello,
> 
> Please review the fix to the following issues:
> 
> https://bugs.openjdk.java.net/browse/JDK-8248655
> https://bugs.openjdk.java.net/browse/JDK-8248434
> 
> The proposed changeset and its CSR are located at:
> 
> https://cr.openjdk.java.net/~naoto/8248655.8248434/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8248664
> 
> A bug was filed against SimpleDateFormat (8248434) where case-insensitive 
> date format/parse failed in some of the new locales in JDK15. The root cause 
> was that case-insensitive String.regionMatches() method did not work with 
> supplementary characters. The problem is that the method's spec does not 
> expect case mappings of supplementary characters, possibly because it was 
> overlooked in the first place, JSR 204 - "Unicode Supplementary Character 
> support". Similar behavior is observed in other two case-insensitive methods, 
> i.e., compareToIgnoreCase() and equalsIgnoreCase().
> 
> The fix is straightforward to compare strings by code point basis, instead of 
> code unit (16bit "char") basis. Technically this change will introduce a 
> backward incompatibility, but I believe it is an incompatibility to wrong 
> behavior, not true to the meaning of those methods' expectations.
> 
> Naoto