Re: [Lazarus] Improving UTF8CharacterLength?

Jürgen Hestermann Thu, 13 Aug 2015 05:07:17 -0700

Am 2015-08-13 um 13:01 schrieb Mattias Gaertner:

On Thu, 13 Aug 2015 12:38:00 +0200
Jürgen Hestermann <juergen.hesterm...@gmx.de> wrote:

Am 2015-08-13 um 11:55 schrieb Mattias Gaertner:
  > A string always ends with a #0, so checking byte by byte makes sure you
  > stay within range.

Not quite true:
------------
if ((ord(p^) and %11110000) = %11100000) then
     begin  // could be 3 byte character
     if ((ord(p[1]) and %11000000) = %10000000) and
        ((ord(p[2]) and %11000000) = %10000000) then ...
     ...
------------
In the above (current) code 3 bytes are accessed which may step behind the zero 
byte.

The "and" operator stops evaluating if left side is already false.


Yes, I see now that I somehow missinterpreted the code.
You are right that for the case that a zero byte exists
it would not access further bytes within UTF8CharacterLength.

Still I think it would be better to give back 3 in case the byte actually
means 3 because 1 byte does not form a valid UTF-8 character.
If I rely on this result I would try to use this 1 byte as a valid UTF-8 
character
which would be wrong so I have to apply further checks to cope with this 
situation anyway.
Then I can also check whether the 3 or 4 bytes of the correct result exist.
I would not loose anything for invalid UTF-8 strings but I would gain 
performance if
I can guarantee valid UTF-8 string.

And if no zero byte exists (for whatever reason) it currently fails anyway.


--
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] Improving UTF8CharacterLength?

Reply via email to