Am 03.03.2010 09:00, schrieb Martin Buchholz:
On Tue, Mar 2, 2010 at 15:34, Ulf Zibis<[email protected]> wrote:
Am 26.08.2009 20:02, schrieb Xueming Shen:
For example, the isBMP(int), it might be convenient, but it can be easily
archived by the one line code
(int)(char)codePoint == codePoint;
or more readable form
codePoint< Character.MIN_SUPPLEMENTARY_COE_POINT;
In class sun.nio.cs.Surrogate we have:
public static boolean isBMP(int uc) {
return (int) (char) uc == uc;
}
1.) It's enough to have:
return (char)uc == uc;
better:
assert MIN_VALUE == 0&& MAX_VALUE == 0xFFFF;
return (char)uc == uc;
// Optimized form of: uc>= MIN_VALUE&& uc<= MAX_VALUE
2.) Above code is compiled to (needs 16 bytes of machine code):
0x00b87ad8: mov %ebx,%ebp
0x00b87ada: and $0xffff,%ebp
0x00b87ae0: cmp %ebx,%ebp
0x00b87ae2: jne 0x00b87c52
0x00b87ae8:
We could code:
assert MIN_VALUE == 0&& (MAX_VALUE + 1) == (1<< 16);
return (uc>> 16) == 0;
// Optimized form of: uc>= MIN_VALUE&& uc<= MAX_VALUE
I agree that
return (uc>> 16) == 0;
is marginally better than my
return (int) (char) uc == uc;
(although I think the redundant cast to int
makes the code more readable).
Seems to be individual. I always stumble over superfluous casts by
thinking about, what they have to do.
I approve such a change to isBMPCodePoint()
and inclusion of such a method in Character.
Pleased! Who could file the bug? I would provide the patch.
is compiled to (needs only 9 bytes of machine code):
0x00b87aac: mov %ebx,%ecx
0x00b87aae: sar $0x10,%ecx
0x00b87ab1: test %ecx,%ecx
0x00b87ab3: je 0x00b87acb
0x00b87ab5:
1.) If we have:
public static boolean isSupplementaryCodePoint(int codePoint) {
assert MIN_SUPPLEMENTARY_CODE_POINT == (1<< 16)&&
(MAX_SUPPLEMENTARY_CODE_POINT + 1) % (1<< 16) == 0;
return (codePoint>> 16) != 0
&& (codePoint>> 16)< (MAX_SUPPLEMENTARY_CODE_POINT + 1>> 16);
// Optimized form of: codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
//&& codePoint<= MAX_SUPPLEMENTARY_CODE_POINT;
}
Keep in mind that supplementary characters are extremely rare.
Yes, but many API's in the JDK are used rarely.
Why should they waste memory footprint / perform bad, particularly if it
doesn't cost anything.
Therefore the existing implementation
return codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
&& codePoint<= MAX_CODE_POINT;
will almost always perform just one comparison against a constant,
which is hard to beat.
1. Wondering: I think there are TWO comparisons.
2. Those comparisons need to load 32 bit values from machine code,
against only 8 bit values in my case.
3. The first of the 2 comparisons becomes outlined if compiled in
combination with isBMPCodePoint(). (see below)
I'm not sure whether your code above gets the right answer for negative input.
Perhaps you need to do
(codePoint>>> 16)< ...
Oops, I'm afraid you are right, but fortunately it doesn't cost
anything: sar becomes replaced by shr.
Martin
and:
if (Surrogate.isBMP(uc))
...;
else if (Character.isSupplementaryCodePoint(uc))
...;
else
...;
we get (needs only 18 bytes of machine code):
0x00b87aac: mov %ebx,%ecx
0x00b87aae: sar $0x10,%ecx
0x00b87ab1: test %ecx,%ecx
0x00b87ab3: je 0x00b87acb
0x00b87ab5: cmp $0x11,%ecx
0x00b87ab8: jge 0x00b87ce6
0x00b87abe:
BTW the compiled code of the existing code (needs *36* bytes of machine
code):
0x00b87a5c: test %ebp,%ebp
0x00b87a5e: jl 0x00b87a68
0x00b87a60: cmp $0x10000,%ebp
0x00b87a66: jl 0x00b87a8d
0x00b87a68: cmp $0x10000,%ebp
0x00b87a6e: jl 0x00b87c63
0x00b87a74: cmp $0x10ffff,%ebp
0x00b87a7a: jg 0x00b87c63
0x00b87a80:
BTW 2: The code example is seen in one of the String constructors, where
there the 2-comparison code is manually inlined instead of using
Surrogate.isBMP().
-Ulf