Re: [Lazarus] Improving UTF8CharacterLength?

Mattias Gaertner Thu, 13 Aug 2015 04:06:57 -0700

On Thu, 13 Aug 2015 12:38:00 +0200
Jürgen Hestermann <[email protected]> wrote:


> Am 2015-08-13 um 11:55 schrieb Mattias Gaertner:
>  > A string always ends with a #0, so checking byte by byte makes sure you
>  > stay within range.
> 
> Not quite true:
> ------------
> if ((ord(p^) and %11110000) = %11100000) then
>     begin  // could be 3 byte character
>     if ((ord(p[1]) and %11000000) = %10000000) and
>        ((ord(p[2]) and %11000000) = %10000000) then ...
>     ...
> ------------
> In the above (current) code 3 bytes are accessed which may step behind the 
> zero byte.

The "and" operator stops evaluating if left side is already false.

> Thats something that needs to be checked in all cases anyway.

No.
That's the advantage of PChar and ASCIIZ.

>[...]
>  > If you know that you have a valid UTF-8 string you can simply use the
>  > first byte of each codepoint (as you pointed out). So, for that case a
>  > faster function can be added.
>  > Maybe UTF8QuickCharLen or something like that.
> 
> Determining the character length of a invalid UTF-8 string is quite useless.
> What do you do with such a result?

Skip.

> IMO the UTF8CharacterLength
> funtion should always assume a valid UTF-8 string.
> Using this function on invalid UTF-8 strings lets you run into problems 
> anyway.

It is already used this way in many places. If you need a function with
a different behavior you need to add new one.

 
>[...]
>  > Yes, although afaik the compiler can optimize a CASE better than a series 
> of IFs
>[...]
> 
> Realy?

Look at the produced assembler or do some benchmarking.

> [...]

Mattias

--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] Improving UTF8CharacterLength?

Reply via email to