Re: [Lazarus] UTF-8 string recognition

Hans-Peter Diettrich Wed, 03 Mar 2010 09:36:18 -0800

JoshyFun schrieb:

RH> Pls check the function I used for check UTF8 string. Hope it helpful
RH> function IsUTF8(UnknownStr:string):boolean;


Well, there is a lot of UTF8 strings that do not pass your checks ;)
If you remove low ascii control chars what happend with UTF8 control
chars ?

RH> var
RH>     i    :Integer;
RH> begin
RH>     if length(UnknownStr)=0 then exit(true);
RH>     i:=1;
RH>     while i<length(UnknownStr) do
RH>     begin
RH>         // ASCII
RH>         if  (UnknownStr[i] = #$09) or
RH>             (UnknownStr[i] = #$0A) or
RH>             (UnknownStr[i] = #$0D) or
RH>             (UnknownStr[i] in [#$20..#$7E]) then
RH>         begin
RH>             inc(i);
RH>             continue;
RH>         end;
RH>         // non-overlong 2-byte
RH>         if  (UnknownStr[i] in [#$C2..#$DF]) and
RH>             (UnknownStr[i+1] in [#$80..#$BF]) then
RH>         begin

It should crashes here with strings like:

var
 s: string;
begin
 s:=$C2;
 IsUTF8(s);
end;

which is not valid UTF8.

That's correct, a possible workaround were the use of PChars, which cansafely access the appended #0.


I'd suggest a state machine or the like for the implementation:
  while True do
    case p^ of
    #0: break; //done, okay if past the end of the string
    #8, #10, #12, #13, ' '..#$7E: inc(p); //okay
    #$C0..#$DF: ... //2 bytes
    #$E0..#$EF: ... //3 bytes
    #$F0..#$F4: ... //4 bytes
    else exit(False); //not valid text
    end;

DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF-8 string recognition

Reply via email to