Re: [Lazarus] UTF-8 string recognition

Robin Hoo Wed, 03 Mar 2010 05:06:07 -0800

Hi, JoshyFun

Thanks for pointing out the bug in my coding, yes you are right. I forgot to
put some checking before every inc(i,k) and continue; there should a
judgement statement
*if i>length(UnknownStr) then exit(false);*


2010/3/3 JoshyFun <[email protected]>

> Hello Lazarus-List,
>
> Wednesday, March 3, 2010, 12:24:35 AM, you wrote:
>
> RH> Pls check the function I used for check UTF8 string. Hope it helpful
> RH> function IsUTF8(UnknownStr:string):boolean;
>
> Well, there is a lot of UTF8 strings that do not pass your checks ;)
> If you remove low ascii control chars what happend with UTF8 control
> chars ?
>
> RH> var
> RH>     i    :Integer;
> RH> begin
> RH>     if length(UnknownStr)=0 then exit(true);
> RH>     i:=1;
> RH>     while i<length(UnknownStr) do
> RH>     begin
> RH>         // ASCII
> RH>         if  (UnknownStr[i] = #$09) or
> RH>             (UnknownStr[i] = #$0A) or
> RH>             (UnknownStr[i] = #$0D) or
> RH>             (UnknownStr[i] in [#$20..#$7E]) then
> RH>         begin
> RH>             inc(i);
> RH>             continue;
> RH>         end;
> RH>         // non-overlong 2-byte
> RH>         if  (UnknownStr[i] in [#$C2..#$DF]) and
> RH>             (UnknownStr[i+1] in [#$80..#$BF]) then
> RH>         begin
>
> It should crashes here with strings like:
>
> var
>  s: string;
> begin
>  s:=$C2;
>  IsUTF8(s);
> end;
>
> which is not valid UTF8.
>
> RH>              // excluding surrogates
> RH>              ((UnknownStr[i]=#$ED) and
> RH>               (UnknownStr[i+1] in [#$80..#$9F]) and
> RH>               (UnknownStr[i+2] in [#$80..#$BF])) then
>
> Surrogates are not UTF8 valid codepoints.
>
> --
> Best regards,
>  JoshyFun
>
>
> --
> _______________________________________________
> Lazarus mailing list
> [email protected]
> http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
>

--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF-8 string recognition

Reply via email to