The algorithm below misses out on the UTF8 encoded code points that are greater than 
0xFFFF. 
 
According to:
http://developers.sun.com/dev/gadc/technicalpublications/articles/utf8.html
 
(although Java pre 1.5 doesn't support UTF32 codepoints, so I'm not sure what would 
happen here if someone send you one of those). 
 
The algorithm found on that page looks like it'll take care of what you're looking 
for. 
 
-- 
Chris Mullins
 
 
 
-----Original Message----- 
From: Cedric Vivier [mailto:[EMAIL PROTECTED] 
Sent: Thu 9/9/2004 2:19 AM 
To: [EMAIL PROTECTED] 
Cc: 
Subject: [jdev] Re: Get the length of the utf-8 sequence in Java



        I do not believe Java has a standard method for this in the standard
        library, but you could implement yours :
        
        
        public int byte_length(String s) {
             int numchars = s.length();
             int numbytes = 0;
        
             for (int i = 0 ; i < numchars ; i++) {
               int c = s.charAt(i);
               if ((c >= 0x0001) && (c <= 0x007F)) numbytes++;
               else if (c > 0x07FF) numbytes += 3;
               else numbytes += 2;
             }
        
             return numbytes;
        }
        
        
        I have no idea if it would be faster than your current method though,
        but it should be more memory-efficient at least.
        
        
        --cedricv
        
        _______________________________________________
        jdev mailing list
        [EMAIL PROTECTED]
        https://jabberstudio.org/mailman/listinfo/jdev
        

_______________________________________________
jdev mailing list
[EMAIL PROTECTED]
https://jabberstudio.org/mailman/listinfo/jdev

Reply via email to