The algorithm below misses out on the UTF8 encoded code points that are greater than
0xFFFF.
According to:
http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html
(although Java pre 1.5 doesn't support UTF32 codepoints, so I'm not sure what would
happen here if someone send you one of those).
The algorithm found on that page looks like it'll take care of what you're looking
for.
--
Chris Mullins
-----Original Message-----
From: Cedric Vivier [mailto:[EMAIL PROTECTED]
Sent: Thu 9/9/2004 2:19 AM
To: [EMAIL PROTECTED]
Cc:
Subject: [jdev] Re: Get the length of the utf-8 sequence in Java
I do not believe Java has a standard method for this in the standard
library, but you could implement yours :
public int byte_length(String s) {
int numchars = s.length();
int numbytes = 0;
for (int i = 0 ; i < numchars ; i++) {
int c = s.charAt(i);
if ((c >= 0x0001) && (c <= 0x007F)) numbytes++;
else if (c > 0x07FF) numbytes += 3;
else numbytes += 2;
}
return numbytes;
}
I have no idea if it would be faster than your current method though,
but it should be more memory-efficient at least.
--cedricv
_______________________________________________
jdev mailing list
[EMAIL PROTECTED]
https://jabberstudio.org/mailman/listinfo/jdev
_______________________________________________
jdev mailing list
[EMAIL PROTECTED]
https://jabberstudio.org/mailman/listinfo/jdev