RE: [lazarus] String functions on non latin text

Panagiotis Sidiropoulos Tue, 28 Feb 2006 08:10:04 -0800

For anyone interest on a UTF8Pos function, here is one as suggested by
Vincent and Mattias:


// Find position into a utf string
function UTF8Pos( cSearcFor, cSearchInto: UTF8String ): integer;
var
   nPos: integer;
   
begin
     nPos := pos( cSearcFor, cSearchInto );
     if  nPos > 0 then Result := UTF8Length( copy( cSearchInto, 1, nPos
- 1 ) )
     else Result := 0;
end;

Now, I'm trying to write a UTF8Copy function to return a specific
ammount of characters (not bytes) from a string. Here is what I've done
till now. It does not work correctly. Do you think I'm in the right path
or is there any other, smarter, way to do this?

// Get a utf character at a specific position
function UTF8Copy( cCopyFrom: UTF8String; nFromPosition, nNoOfChars:
integer ): UTF8String;
var
   i,
   nUTF8Len,
   nByteLen,
   nStart: integer;

begin
     Result := '';
     nUTF8Len := UTF8Length( cCopyFrom );
     if nFromPosition > nUTF8Len then exit;
     
     nByteLen := Length( cCopyFrom );
     nStart := 0;
     for i := 1 to nByteLen do begin

         if UTF8Length( copy( cCopyFrom, 1, i ) ) = nFromPosition then
nStart := i + 1;
         if ( nStart > 0 ) and
            ( UTF8Length( copy( cCopyFrom, nStart, i ) ) = nNoOfChars )
then break;

     end;
     Result := copy( cCopyFrom, nStart, i );
end;

Panagiotis

-----Original Message-----
From: Mattias Gaertner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 28, 2006 12:24 PM
To: [email protected]
Subject: Re: [lazarus] String functions on non latin text


On Tue, 28 Feb 2006 09:57:09 +0200
"Panagiotis Sidiropoulos" <[EMAIL PROTECTED]> wrote:

> >so if there is something wrong with the sample I thought it should
> >be gtk2 and the only problem I found was the position returned 
> >mismatched visually the substring
> 
> I tried to find a relation between results but there is no any kind of

> pattern, for example, for the first character give 1, the second 3 and

> 21st give 41. Visually mismatch is the problem, I need to rearrange 
> characters for indexing reasons and can't trace what character is what

> into convertion table.

Jesus is right.
UTF8 is a multi byte character encoding. This means a character has a
size varying between 1 to 4. To get the character position use:

BytePos:=System.Pos(search,text);
if BytePos>0 then
  CharPos:=UTF8Length(Pchar(text),BytePos-1)
else
  CharPos:=0;

  
Mattias


> 
> I will try to update Lazarus and FPC, just to be sure.
> 
> Panagiotis
> 
> -----Original Message-----
> From: Jesus Reyes [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 28, 2006 8:30 AM
> To: [email protected]
> Subject: Re: [lazarus] String functions on non latin text
> 
> 
> 
> ----- Original Message -----
> From: "Mattias Gaertner" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Monday, February 27, 2006 2:45 PM
> Subject: Re: [lazarus] String functions on non latin text
> 
> 
> > On Mon, 27 Feb 2006 13:41:13 -0600 (CST)
> > Jesus Reyes <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > >  --- Panagiotis Sidiropoulos <[EMAIL PROTECTED]> escribió:
> > > 
> > > > Please download sample project at:
> > > > - www.magentadb.gr/ftp/pos-sample.zip
> > > > 
> > > > Panagiotis
> > > > 
> > > 
> > > result := Pos(UTF8Decode(SubStr), UTF8Decode(Str));
> > > 
> > > seems to work, I think Pos(UTF8String,UTF8String) is yet to be
> > > implemented.
> > 
> > It does not need to be implemented. One nice feature of UTF8 is, 
> > that
> > you can find out the start of an UTF8 character without parsing the 
> > whole string. A simple substring search works with UTF8 and is 
> > unambiguous.
> 
> I guess it would depend on the need for the pos function return value,

> if some  feedback should be made to the user about the position the 
> substring matched then current pos functions doesn't not return a 
> visually right position, I mean
> counting characters form  left to right, the correct position should
be
> 21 not 41.
> 
> If the value is to be user with other string functions then the return

> value is right.
> 
> if the function is ever implemented I think it should be for something

> like
> pos(UTFString,UTFString) where UTFString should represent any UTF 
> Encoding in use. Unlikely? maybe :D
> 
> > On the other hand: UTF8Decode will fail on some character sets, not
> > fitting into 2byte characters.
> 
> it seems to have support for at least 3 byte chars. I didn't test 
> tho..
> 
> > 
> > My guess, why a simple Pos does not work for Panagiotis, is a either

> > a
> 
> > FPC bug or a gtk1 bug with greek characters.
> > 
> 
> I compiled the test first for gtk1 and results looked right to me, so 
> if there is something wrong with the sample I thought it should be 
> gtk2 and the only problem I found was the position returned mismatched

> visually the substring
> 
> > 
> > Mattias
> > 
> 
> Jesus Reyes A.
> 
> __________________________________________________
> Correo Yahoo!
> Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
> Regístrate ya - http://correo.yahoo.com.mx/ 
> 
> _________________________________________________________________
>      To unsubscribe: mail [EMAIL PROTECTED] with
>                 "unsubscribe" as the Subject
>    archives at http://www.lazarus.freepascal.org/mailarchives
> 
> _________________________________________________________________
>      To unsubscribe: mail [EMAIL PROTECTED] with
>                 "unsubscribe" as the Subject
>    archives at http://www.lazarus.freepascal.org/mailarchives

_________________________________________________________________
     To unsubscribe: mail [EMAIL PROTECTED] with
                "unsubscribe" as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives

_________________________________________________________________
     To unsubscribe: mail [EMAIL PROTECTED] with
                "unsubscribe" as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives

RE: [lazarus] String functions on non latin text

Reply via email to