JoshyFun schrieb:
RH> Pls check the function I used for check UTF8 string. Hope it helpful
RH> function IsUTF8(UnknownStr:string):boolean;
Well, there is a lot of UTF8 strings that do not pass your checks ;)
If you remove low ascii control chars what happend with UTF8 control
chars ?
RH> var
RH> i :Integer;
RH> begin
RH> if length(UnknownStr)=0 then exit(true);
RH> i:=1;
RH> while i<length(UnknownStr) do
RH> begin
RH> // ASCII
RH> if (UnknownStr[i] = #$09) or
RH> (UnknownStr[i] = #$0A) or
RH> (UnknownStr[i] = #$0D) or
RH> (UnknownStr[i] in [#$20..#$7E]) then
RH> begin
RH> inc(i);
RH> continue;
RH> end;
RH> // non-overlong 2-byte
RH> if (UnknownStr[i] in [#$C2..#$DF]) and
RH> (UnknownStr[i+1] in [#$80..#$BF]) then
RH> begin
It should crashes here with strings like:
var
s: string;
begin
s:=$C2;
IsUTF8(s);
end;
which is not valid UTF8.
That's correct, a possible workaround were the use of PChars, which can
safely access the appended #0.
I'd suggest a state machine or the like for the implementation:
while True do
case p^ of
#0: break; //done, okay if past the end of the string
#8, #10, #12, #13, ' '..#$7E: inc(p); //okay
#$C0..#$DF: ... //2 bytes
#$E0..#$EF: ... //3 bytes
#$F0..#$F4: ... //4 bytes
else exit(False); //not valid text
end;
DoDi
--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus