Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-14 Thread LacaK


  

So IMHO there must be:
1. allocated space in record buffer in size 4*TFieldDef.Size+1 



  

2. redefine meaning of Size property (as number of bytes not
characters) and create fielddefs with Size*4



Yes, those are the possible solutions. Good thing about the second
option, is that a user can do that on his own if he wants to use UTF-8,
just create persistent fields with a field size of 4*the amount of
characters. I'm not sure if we have to change this. It's a problem the
programmer has to deal with, I think...

  

I see here these possible problems/disadvantages:
1. In many cases (dynamic build or ad-hoc queries) is creating 
persistent fields not very effective (or complicated)
2. allocation of space in record buffer is based on TFieldDef objects 
(see TCustomBufDataset.GetFieldSize) and TFieldDef object are created by 
TSQLConnectors in AddFieldDefs method, so setting Size in persistent 
field does not solve whole problem, because each SQLConnector must set 
Size also
3. in TStringField is Size used also to determine default DisplayWidth 
(for TDBGrid) and in Delphi also for setting MaxLength in TDBEdit
(so here we can see, that Size is used like max.number of characters 
rather than bytes)
4. incompatibility of Delphi (if we reclasiffy Size as number of bytes 
not characters)


So I would prefer 1st way (increase buffer size, may be if we will 
support only BMP then 3*Size+1 will be sufficient)

So Size remains as character length


hm, according to
http://docwiki.embarcadero.com/VCL/XE/en/DB.TStringField.Size is Size
number of characters
but according to
http://docwiki.embarcadero.com/VCL/en/DB.TFieldDef.Size is Size number
of bytes in underlaying database



Yes, that's indeed the problem. But there's also the .DataSize property,
so we could use that.

  

Yes, but DataSize is defined only in TField not in TFieldDef class
If we will add DataSize property also for TFieldDef class and overload 
TFieldDef.Create method with additional parameter DataSize, then 
SQLConnector would specify both information: character size (for 
displaying purposes) and byte size (for buffer purposes)



Maybe... if the pressure on the bugtracker gets too high, I'll bow and
change this. (I think 25% of all existing db bugs are related to this
and people who do not understand anything about encodings.)

  

but TField is created from TFieldDef and
TField.Size=TFieldDef.Size ... so isn't it curious ?


Not that when you want to use UTF-16 (or 32) you have to use
TWideStringFields.

  
  

So TWideStringField is no-encoding-agnostic field (is it designed to
be everytime UTF-16 encoded) ?



No. It's designed to contain an array of two-bytes records. In fact you
can use it to store UCS-2 data, but not UTF-16. Same story as with
ansi/UTF-8.

  

Yes I understand now.

Laco.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-14 Thread LacaK



So this is answer, which i have looked for:
In Lazarus TStringField MUST hold UTF-8 encoded strings.



Not entirely true. You could also choose to bind the fields to some
Lazarus-components manually, not using the db-components.

IMHO most of gui database applications use controls like TDBGrid or TDBEdit
so they should display correct values by default without extra coding 
(or at least provide some standardized support ... )




 (Tedit.Text :=
convertFunc(StringField.Text)) Or you can add a hook so that the .text
property always does a conversion to UTF-8. First option can be used if
you use a mediator or view. Second options I woudn't use.

  
Rofl. You mean that Microsoft SQL Server can't handle unicode
completely? 
  

Completely not, but only UCS-2 (no UTF-8)

SQL Server provides non-UNICODE datatypes - char, varchar, text 



ie: TStringField
  
Yes, but ODBC driver returns data in ANSI codepage (no possibility to 
force them return UTF-8)

This I can fix by patch in TODBCConnection LoadField like this:
(so I convert to UTF-8 in connector method, when driver is unable return 
UTF-8)

   begin
 Res:=SQLGetData(ODBCCursor.FSTMTHandle, FieldDef.Index+1, 
SQL_C_CHAR, buffer, FieldDef.Size, @StrLenOrInd);

+  if CharSet='ANSI' then //hack for Microsoft SQL Server
+StrPLCopy(buffer, UTF8Encode(PChar(buffer)), FieldDef.Size);
   end;   

  

 and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext



ie: TWideStringField.
  
Yes, in this case ODBC driver returns data in UCS-2, this data are 
written into WideString buffer, which seems correct, but in DBGrid are 
displayed ? instead of characters with diacritical marks (IMHO because 
widestringmanager in Windows converts WideString to ANSI string , not 
UTF-8 string).
This can be fixed by using OnGetText method of field: 
aText:=UTF8Encode(Sender.AsString);
Which is not user friendly, because requires hacking in user code in 
every TWideStringField in every TSQLQuery

It can be also fixed in fields.inc:
function TWideStringField.GetAsString: string;
begin
+{$IFDEF WINDOWS}
+  Result := UTF8Encode(GetAsWideString);
+{$ELSE}
 Result := GetAsWideString;
+{$ENDIF}
end;

So what is the expected encoding of data written into TWideStringField 
... or is there way how to get correct results id DBGrid without above 
mentioned workarounds ?


  

 SQL Server ODBC driver supports AutoTranslate, see:
http://msdn.microsoft.com/en-us/library/ms130822.aspx
 SQL Server char, varchar, or text data sent to a client SQL_C_CHAR
variable is converted from character to Unicode using the server ACP,
then converted from Unicode to character using the client ACP.



This is what you use when you set the encoding when you connect to the
client. The solution to all your problems. As explained three times, in
this message alone.

In fact it's simple: incoming data=outgoing data.

If you need UTF-8 encoding for the outgoing data (direct access to
Lazarus controls) you have to select UTF-8 at the input.
Yes, but as I wrote such possibility does not exists with Microsoft SQL 
Server (and also I think Access)

(it seems, that Microsoft does not like UTF-8 and prefers UTF-16 (UCS-2))


And, luckily, you can instruct the Database-server which encoding to use
when it's communicating with the outer world. So your problem is solved.
  

When it is possiblem then yes.


Now, if you also choose UTF-8 as the Database-server field encoding (the
encoding the data is stored in) there's no conversion necessary at all.
  

Yes if DB supports UTF-8

-Laco.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-13 Thread LacaK



Didn't I explain this to you and others a few times?
  

;-) If so, then please excuse me


The database-components itself are encoding-agnostic. This means:
encoding in = encoding out.

So it is up to the developer what codepage he want to use. So
TField.Text can have the encoding _you_ want.

So, if you want to work with Lazarus, which uses UTF-8, you have to use
UTF-8 encoded strings in your database. 
  

So this is answer, which i have looked for:
In Lazarus TStringField MUST hold UTF-8 encoded strings.

But I guess (I have theory), that in time, when Borland introduced 
TStringField, the design goal was:
TStringField was designed for SBCS (because DataSize=Size+1) string data 
encoded in system ANSI code page

and
TWideStringField was designed for DBCS widestring (UTF-16) character data

May be, that I was mistaken by this view.
(or may be, that there is different approach in Delphi (no agnostic) 
and different in FPC (agnostic)?)



If there is some strange reason why you don't want the strings in your
database to be UTF-8 encoded,

SQL Server does not support UTF-8 (AFAIK)
SQL Server provides non-UNICODE datatypes - char, varchar, text
and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext


 you have to convert the strings from the
encoding your database uses to UTF-8 while reading data from the
database.

Luckily, you can specify the encoding of strings you want to use for
most databases. Not only the encoding in which the strings are stored,
but also the encoding which has to be used when you send and retrieve
data from the database. And you can set this for each connection made.

Ie: you can resolve the problem by changing the connection-string, or by
adding some connection-parameter.

  

Yes, it is true for example for MySQL or Firebird ODBC driver,
but for SQL Server or PostgreSQL ODBC driver there are no such options
(but PostgreSQL ODBC driver exists in ANSI and UNICODE version)
SQL Server ODBC driver supports AutoTranslate, see: 
http://msdn.microsoft.com/en-us/library/ms130822.aspx
SQL Server *char*, *varchar*, or *text* data sent to a client 
SQL_C_CHAR variable is converted from character to Unicode using the 
server ACP, then converted from Unicode to character using the client ACP.

There's also another solution you can find on the forum and other
places. You can convert the strings to UTF-8 not only when they are read
from the database, but also when they are read from the internal memory.
There's a hook for that.

  

Thanks for your patience
-Laco.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-13 Thread LacaK

Joost van der Sluis  wrote / nap?sal(a):

On Wed, 2011-01-12 at 14:59 +0100, LacaK wrote:

  

No. It is mandatory that you send/receive UTF8 to/from GUI LCL
elements. 
  
As LCL elements are using TStringField.Text property, then this property 
should return UTF8String, right (not AnsiString in ANSI code page) ?
If yes, then also TStringField must store internaly data in any unicode 
format (to not lose any characters), right ?
So it can be UTF-8, UTF-16 or UTF-32 ... in all cases we must allocate 
space 4*[max.number of characters in field], right ?

So in what encoding are string data stored now in TStringField ?



The encoding you've specified. In the connection-string or some other
database-server dependent setting.
  

ok.
But then there is problem in buffer size allocated for TStringField 
(ftString), does not ?

See please at bug report: http://bugs.freepascal.org/view.php?id=17376
There is described situation with SQLite (TSQLite3Connectin) , which 
returns UTF-8 strings, so there is no problem in encoding,
but problem is in fact, that for char(n),varchar(n) fields is created 
TStringField with Size=n and in record buffer is also allocated space 
with Size+1, where n is number of characters (not bytes). So truncation 
of data occurs, when writting UTF-8 encoded string into record buffer.

So IMHO there must be:
1. allocated space in record buffer in size 4*TFieldDef.Size+1 (and so on)
or
2. redefine meaning of Size property (as number of bytes not characters) 
and create fielddefs with Size*4
hm, according to 
http://docwiki.embarcadero.com/VCL/XE/en/DB.TStringField.Size is Size 
number of characters
but according to http://docwiki.embarcadero.com/VCL/en/DB.TFieldDef.Size 
is Size number of bytes in underlaying database


but TField is created from TFieldDef and TField.Size=TFieldDef.Size ... 
so isn't it curious ?

Not that when you want to use UTF-16 (or 32) you have to use
TWideStringFields.

  
So TWideStringField is no-encoding-agnostic field (is it designed to 
be everytime UTF-16 encoded) ?


-Laco.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-13 Thread LacaK



L Yes in UNIX world it may be so (I do not know),
L but in Windows ODBC we have no such possibility AFAIK

Quote from Microsoft:
The ODBC 3.5 (or higher) Driver Manager supports both ANSI and
Unicode versions of all functions that accept pointers to character
strings or SQLPOINTER in their arguments. The Unicode functions are
implemented as functions (with a suffix of W), not as macros. The ANSI
functions (which can be called with or without a suffix of A) are
identical to the current ODBC API functions.

ODBC 3.5 was launched around 2000-2001.

  
But this approach will require changes in packages/odbc/src/odbcsql.inc 
like, does not ?:
-pointer(SQLGetData) := 
GetProcedureAddress(ODBCLibraryHandle,'SQLGetData');
+pointer(SQLGetData) := 
GetProcedureAddress(ODBCLibraryHandle,'SQLGetDataW');
And I do not know how it affect compatibility for example in UNIX or if 
all ODBC drivers support this functionality.


But also in this case we will get UTF-16 widestrings (in Windows) not 
UTF-8, does not ?


-Laco.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-13 Thread LacaK



Most probably it needs a flag to indicate that the ODBC must work in
Unicode, and then dynamic link to *W functions if this flag is set. I
think ODBC drivers since 2002+/- should have this set of APIs, but I
had never used ODBC in my life... :)
  
It seems, that Driver Manager automaticaly performs conversions on 
non-unicode driver. See 
http://web.datadirect.com/resources/odbc/unicode/odbc-driver.html
( and 
http://web.datadirect.com/resources/odbc/unicode/char-background.html )
If the driver is a non-Unicode driver, it cannot understand W function 
calls, and the Driver Manager must convert them to ANSI calls before 
sending them to the driver.


Also it seems to me, that when you call ANSI version of ODBC API 
functions, then you receive data in ANSI encoding.
If it is so, then it is always safe use ansitoutf8() (or UTF8Encode()) 
on receved data.



L But also in this case we will get UTF-16 widestrings (in Windows) not
L UTF-8, does not ?

That's not important, you get unicode in the specified by the API
format, then SQLConnector fills information in the expected target
format (WideString, UTF8String over AnsiString, Raw bytes...).

  

In Windows we get UTF-16, in Linux/UNIX we get UTF-8
So it is so, that in Windows is widestring=UTF-16 and in Linux/UNIX is 
widestring=UTF-8 string ?
(So there is different meaning of widestring type on different OSeses ? 
I am only Windows developer, so I have no understanding of others OSeses 
;-))


-Laco.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-13 Thread Joost van der Sluis
On Thu, 2011-01-13 at 09:15 +0100, LacaK wrote:
 
  Didn't I explain this to you and others a few times?

 ;-) If so, then please excuse me
 
  The database-components itself are encoding-agnostic. This means:
  encoding in = encoding out.
  
  So it is up to the developer what codepage he want to use. So
  TField.Text can have the encoding _you_ want.
  
  So, if you want to work with Lazarus, which uses UTF-8, you have to use
  UTF-8 encoded strings in your database. 

 So this is answer, which i have looked for:
 In Lazarus TStringField MUST hold UTF-8 encoded strings.

Not entirely true. You could also choose to bind the fields to some
Lazarus-components manually, not using the db-components. (Tedit.Text :=
convertFunc(StringField.Text)) Or you can add a hook so that the .text
property always does a conversion to UTF-8. First option can be used if
you use a mediator or view. Second options I woudn't use.

 But I guess (I have theory), that in time, when Borland introduced
 TStringField, the design goal was:
 TStringField was designed for SBCS (because DataSize=Size+1) string
 data encoded in system ANSI code page and TWideStringField was
 designed for DBCS widestring (UTF-16) character data

You have to be really careful in what you type, when you are writing
about encodings. The above is nonsense, because of a very tiny mistake.

If you compare DBCS widestring with UTF-16, you can also compare a
stringfield with UTF-8. Exactly the same problem. (A character can be
made up from more then one UTF-8 or UTF-16 codepoint)

But TStringField's datasize by default is indeed Size+1. So if you use
it t store UTF-8, you have to define the size as four times the
field-size given by the database. Note that this is done in some cases.

 May be, that I was mistaken by this view.
 (or may be, that there is different approach in Delphi (no agnostic)
 and different in FPC (agnostic)?)

No, Delphi does the same. Only newer Delphi versions have a string-type
which contains the used encoding (details can be found in this thread),
so can do some conversions for you. But that has nothing to do with the
database-code. Also, you don't need it. People all over the world have
used older Delphi versions all the time... (But offcourse, it's easier
now)

  If there is some strange reason why you don't want the strings in your
  database to be UTF-8 encoded,
 SQL Server does not support UTF-8 (AFAIK)

Rofl. You mean that Microsoft SQL Server can't handle unicode
completely? If they say that in an advertisement they can forget that
any big commercial client will choose their product...

 SQL Server provides non-UNICODE datatypes - char, varchar, text 

ie: TStringField

  and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext

ie: TWideStringField.

What does this have to do with your problem? Nothing. Only things what
matters is what encoding is used while communicating with the client.
(Which you can set)

  you have to convert the strings from the
  encoding your database uses to UTF-8 while reading data from the
  database.
  
  Luckily, you can specify the encoding of strings you want to use for
  most databases. Not only the encoding in which the strings are stored,
  but also the encoding which has to be used when you send and retrieve
  data from the database. And you can set this for each connection made.
  
  Ie: you can resolve the problem by changing the connection-string, or by
  adding some connection-parameter.
  

 Yes, it is true for example for MySQL or Firebird ODBC driver,
  but for SQL Server or PostgreSQL ODBC driver there are no such
 options

Then that option has to be added. I think it's already possible but you
simply don't know how. (Sql-Server is ODBC only, so that one is fixed.
For firebird there's a 'serverencoding' parameter, or something like
that. Postgres also has some setting.

  (but PostgreSQL ODBC driver exists in ANSI and UNICODE version)

I saw that in an earlier message, but also this has nothing to do with
your problem. You only need the different calls when you want to use
UTF-8 in your fieldnames. (Or, and this one was tricky, in the
connection-string. But this was more then a year ago.)

  SQL Server ODBC driver supports AutoTranslate, see:
 http://msdn.microsoft.com/en-us/library/ms130822.aspx
  SQL Server char, varchar, or text data sent to a client SQL_C_CHAR
 variable is converted from character to Unicode using the server ACP,
 then converted from Unicode to character using the client ACP.

This is what you use when you set the encoding when you connect to the
client. The solution to all your problems. As explained three times, in
this message alone.

In fact it's simple: incoming data=outgoing data.

If you need UTF-8 encoding for the outgoing data (direct access to
Lazarus controls) you have to select UTF-8 at the input. That's always
more efficient than converting the data to/from any other encoding.

And, luckily, you can instruct the Database-server which encoding 

Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String

2011-01-13 Thread Joost van der Sluis
On Thu, 2011-01-13 at 09:49 +0100, LacaK wrote:

 So IMHO there must be:
 1. allocated space in record buffer in size 4*TFieldDef.Size+1 

 2. redefine meaning of Size property (as number of bytes not
 characters) and create fielddefs with Size*4

Yes, those are the possible solutions. Good thing about the second
option, is that a user can do that on his own if he wants to use UTF-8,
just create persistent fields with a field size of 4*the amount of
characters. I'm not sure if we have to change this. It's a problem the
programmer has to deal with, I think...

 hm, according to
 http://docwiki.embarcadero.com/VCL/XE/en/DB.TStringField.Size is Size
 number of characters
 but according to
 http://docwiki.embarcadero.com/VCL/en/DB.TFieldDef.Size is Size number
 of bytes in underlaying database

Yes, that's indeed the problem. But there's also the .DataSize property,
so we could use that.

Maybe... if the pressure on the bugtracker gets too high, I'll bow and
change this. (I think 25% of all existing db bugs are related to this
and people who do not understand anything about encodings.)

 but TField is created from TFieldDef and
 TField.Size=TFieldDef.Size ... so isn't it curious ?
  Not that when you want to use UTF-16 (or 32) you have to use
  TWideStringFields.
  

 So TWideStringField is no-encoding-agnostic field (is it designed to
 be everytime UTF-16 encoded) ?

No. It's designed to contain an array of two-bytes records. In fact you
can use it to store UCS-2 data, but not UTF-16. Same story as with
ansi/UTF-8.

Joost.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel