Paul Moore wrote:
On 03/07/2008, Guido van Rossum <[EMAIL PROTECTED]> wrote:
I don't see an answer there to the question of whether the length()
method of a Java String object containing a single surrogate pair
returns 1 or 2; I suspect it returns 2.

It appears you're right:

type testucs.java
class testucs {
    public static void main(String[] args) {
        StringBuilder s = new StringBuilder("Hello, ");
        s.appendCodePoint(0x2F81A);
        System.out.println(s); // Display the string.
        System.out.println(s.length());
    }
}

java testucs
Hello, ?
9

java -version
java version "1.6.0_05"
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing)

Python 3 supports things like
chr(0x12345) and ord("\U00012345"). (And so does Python 2, using
unichr and unicode literals.)

And Java doesn't appear to - that appendCodePoint() method was
wonderfully hard to find :-)

There's also the issue of indexing the Unicode strings. If we are going to insist that len(u) counts surrogate pairs as one character then random access to the characters of a string is going to be an extremely inefficient operation.

Surely it's desirable under all circumstances that

  len(u) == sum(1 for c in u)

and that

  [c for c in u] == [c[i] for i in range(*len(u))]

How would that play under Jeroen's proposed change?

regards
 Steve
--
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to