On Thu, 13 Aug 2015 12:38:00 +0200 Jürgen Hestermann <[email protected]> wrote:
> Am 2015-08-13 um 11:55 schrieb Mattias Gaertner: > > A string always ends with a #0, so checking byte by byte makes sure you > > stay within range. > > Not quite true: > ------------ > if ((ord(p^) and %11110000) = %11100000) then > begin // could be 3 byte character > if ((ord(p[1]) and %11000000) = %10000000) and > ((ord(p[2]) and %11000000) = %10000000) then ... > ... > ------------ > In the above (current) code 3 bytes are accessed which may step behind the > zero byte. The "and" operator stops evaluating if left side is already false. > Thats something that needs to be checked in all cases anyway. No. That's the advantage of PChar and ASCIIZ. >[...] > > If you know that you have a valid UTF-8 string you can simply use the > > first byte of each codepoint (as you pointed out). So, for that case a > > faster function can be added. > > Maybe UTF8QuickCharLen or something like that. > > Determining the character length of a invalid UTF-8 string is quite useless. > What do you do with such a result? Skip. > IMO the UTF8CharacterLength > funtion should always assume a valid UTF-8 string. > Using this function on invalid UTF-8 strings lets you run into problems > anyway. It is already used this way in many places. If you need a function with a different behavior you need to add new one. >[...] > > Yes, although afaik the compiler can optimize a CASE better than a series > of IFs >[...] > > Realy? Look at the produced assembler or do some benchmarking. > [...] Mattias -- _______________________________________________ Lazarus mailing list [email protected] http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
