Hi, JoshyFun Thanks for pointing out the bug in my coding, yes you are right. I forgot to put some checking before every inc(i,k) and continue; there should a judgement statement *if i>length(UnknownStr) then exit(false);*
2010/3/3 JoshyFun <[email protected]> > Hello Lazarus-List, > > Wednesday, March 3, 2010, 12:24:35 AM, you wrote: > > RH> Pls check the function I used for check UTF8 string. Hope it helpful > RH> function IsUTF8(UnknownStr:string):boolean; > > Well, there is a lot of UTF8 strings that do not pass your checks ;) > If you remove low ascii control chars what happend with UTF8 control > chars ? > > RH> var > RH> i :Integer; > RH> begin > RH> if length(UnknownStr)=0 then exit(true); > RH> i:=1; > RH> while i<length(UnknownStr) do > RH> begin > RH> // ASCII > RH> if (UnknownStr[i] = #$09) or > RH> (UnknownStr[i] = #$0A) or > RH> (UnknownStr[i] = #$0D) or > RH> (UnknownStr[i] in [#$20..#$7E]) then > RH> begin > RH> inc(i); > RH> continue; > RH> end; > RH> // non-overlong 2-byte > RH> if (UnknownStr[i] in [#$C2..#$DF]) and > RH> (UnknownStr[i+1] in [#$80..#$BF]) then > RH> begin > > It should crashes here with strings like: > > var > s: string; > begin > s:=$C2; > IsUTF8(s); > end; > > which is not valid UTF8. > > RH> // excluding surrogates > RH> ((UnknownStr[i]=#$ED) and > RH> (UnknownStr[i+1] in [#$80..#$9F]) and > RH> (UnknownStr[i+2] in [#$80..#$BF])) then > > Surrogates are not UTF8 valid codepoints. > > -- > Best regards, > JoshyFun > > > -- > _______________________________________________ > Lazarus mailing list > [email protected] > http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus >
-- _______________________________________________ Lazarus mailing list [email protected] http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus
