Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
So IMHO there must be: 1. allocated space in record buffer in size 4*TFieldDef.Size+1 2. redefine meaning of Size property (as number of bytes not characters) and create fielddefs with Size*4 Yes, those are the possible solutions. Good thing about the second option, is that a user can do that on his own if he wants to use UTF-8, just create persistent fields with a field size of 4*the amount of characters. I'm not sure if we have to change this. It's a problem the programmer has to deal with, I think... I see here these possible problems/disadvantages: 1. In many cases (dynamic build or ad-hoc queries) is creating persistent fields not very effective (or complicated) 2. allocation of space in record buffer is based on TFieldDef objects (see TCustomBufDataset.GetFieldSize) and TFieldDef object are created by TSQLConnectors in AddFieldDefs method, so setting Size in persistent field does not solve whole problem, because each SQLConnector must set Size also 3. in TStringField is Size used also to determine default DisplayWidth (for TDBGrid) and in Delphi also for setting MaxLength in TDBEdit (so here we can see, that Size is used like max.number of characters rather than bytes) 4. incompatibility of Delphi (if we reclasiffy Size as number of bytes not characters) So I would prefer 1st way (increase buffer size, may be if we will support only BMP then 3*Size+1 will be sufficient) So Size remains as character length hm, according to http://docwiki.embarcadero.com/VCL/XE/en/DB.TStringField.Size is Size number of characters but according to http://docwiki.embarcadero.com/VCL/en/DB.TFieldDef.Size is Size number of bytes in underlaying database Yes, that's indeed the problem. But there's also the .DataSize property, so we could use that. Yes, but DataSize is defined only in TField not in TFieldDef class If we will add DataSize property also for TFieldDef class and overload TFieldDef.Create method with additional parameter DataSize, then SQLConnector would specify both information: character size (for displaying purposes) and byte size (for buffer purposes) Maybe... if the pressure on the bugtracker gets too high, I'll bow and change this. (I think 25% of all existing db bugs are related to this and people who do not understand anything about encodings.) but TField is created from TFieldDef and TField.Size=TFieldDef.Size ... so isn't it curious ? Not that when you want to use UTF-16 (or 32) you have to use TWideStringFields. So TWideStringField is no-encoding-agnostic field (is it designed to be everytime UTF-16 encoded) ? No. It's designed to contain an array of two-bytes records. In fact you can use it to store UCS-2 data, but not UTF-16. Same story as with ansi/UTF-8. Yes I understand now. Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
So this is answer, which i have looked for: In Lazarus TStringField MUST hold UTF-8 encoded strings. Not entirely true. You could also choose to bind the fields to some Lazarus-components manually, not using the db-components. IMHO most of gui database applications use controls like TDBGrid or TDBEdit so they should display correct values by default without extra coding (or at least provide some standardized support ... ) (Tedit.Text := convertFunc(StringField.Text)) Or you can add a hook so that the .text property always does a conversion to UTF-8. First option can be used if you use a mediator or view. Second options I woudn't use. Rofl. You mean that Microsoft SQL Server can't handle unicode completely? Completely not, but only UCS-2 (no UTF-8) SQL Server provides non-UNICODE datatypes - char, varchar, text ie: TStringField Yes, but ODBC driver returns data in ANSI codepage (no possibility to force them return UTF-8) This I can fix by patch in TODBCConnection LoadField like this: (so I convert to UTF-8 in connector method, when driver is unable return UTF-8) begin Res:=SQLGetData(ODBCCursor.FSTMTHandle, FieldDef.Index+1, SQL_C_CHAR, buffer, FieldDef.Size, @StrLenOrInd); + if CharSet='ANSI' then //hack for Microsoft SQL Server +StrPLCopy(buffer, UTF8Encode(PChar(buffer)), FieldDef.Size); end; and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext ie: TWideStringField. Yes, in this case ODBC driver returns data in UCS-2, this data are written into WideString buffer, which seems correct, but in DBGrid are displayed ? instead of characters with diacritical marks (IMHO because widestringmanager in Windows converts WideString to ANSI string , not UTF-8 string). This can be fixed by using OnGetText method of field: aText:=UTF8Encode(Sender.AsString); Which is not user friendly, because requires hacking in user code in every TWideStringField in every TSQLQuery It can be also fixed in fields.inc: function TWideStringField.GetAsString: string; begin +{$IFDEF WINDOWS} + Result := UTF8Encode(GetAsWideString); +{$ELSE} Result := GetAsWideString; +{$ENDIF} end; So what is the expected encoding of data written into TWideStringField ... or is there way how to get correct results id DBGrid without above mentioned workarounds ? SQL Server ODBC driver supports AutoTranslate, see: http://msdn.microsoft.com/en-us/library/ms130822.aspx SQL Server char, varchar, or text data sent to a client SQL_C_CHAR variable is converted from character to Unicode using the server ACP, then converted from Unicode to character using the client ACP. This is what you use when you set the encoding when you connect to the client. The solution to all your problems. As explained three times, in this message alone. In fact it's simple: incoming data=outgoing data. If you need UTF-8 encoding for the outgoing data (direct access to Lazarus controls) you have to select UTF-8 at the input. Yes, but as I wrote such possibility does not exists with Microsoft SQL Server (and also I think Access) (it seems, that Microsoft does not like UTF-8 and prefers UTF-16 (UCS-2)) And, luckily, you can instruct the Database-server which encoding to use when it's communicating with the outer world. So your problem is solved. When it is possiblem then yes. Now, if you also choose UTF-8 as the Database-server field encoding (the encoding the data is stored in) there's no conversion necessary at all. Yes if DB supports UTF-8 -Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
Didn't I explain this to you and others a few times? ;-) If so, then please excuse me The database-components itself are encoding-agnostic. This means: encoding in = encoding out. So it is up to the developer what codepage he want to use. So TField.Text can have the encoding _you_ want. So, if you want to work with Lazarus, which uses UTF-8, you have to use UTF-8 encoded strings in your database. So this is answer, which i have looked for: In Lazarus TStringField MUST hold UTF-8 encoded strings. But I guess (I have theory), that in time, when Borland introduced TStringField, the design goal was: TStringField was designed for SBCS (because DataSize=Size+1) string data encoded in system ANSI code page and TWideStringField was designed for DBCS widestring (UTF-16) character data May be, that I was mistaken by this view. (or may be, that there is different approach in Delphi (no agnostic) and different in FPC (agnostic)?) If there is some strange reason why you don't want the strings in your database to be UTF-8 encoded, SQL Server does not support UTF-8 (AFAIK) SQL Server provides non-UNICODE datatypes - char, varchar, text and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext you have to convert the strings from the encoding your database uses to UTF-8 while reading data from the database. Luckily, you can specify the encoding of strings you want to use for most databases. Not only the encoding in which the strings are stored, but also the encoding which has to be used when you send and retrieve data from the database. And you can set this for each connection made. Ie: you can resolve the problem by changing the connection-string, or by adding some connection-parameter. Yes, it is true for example for MySQL or Firebird ODBC driver, but for SQL Server or PostgreSQL ODBC driver there are no such options (but PostgreSQL ODBC driver exists in ANSI and UNICODE version) SQL Server ODBC driver supports AutoTranslate, see: http://msdn.microsoft.com/en-us/library/ms130822.aspx SQL Server *char*, *varchar*, or *text* data sent to a client SQL_C_CHAR variable is converted from character to Unicode using the server ACP, then converted from Unicode to character using the client ACP. There's also another solution you can find on the forum and other places. You can convert the strings to UTF-8 not only when they are read from the database, but also when they are read from the internal memory. There's a hook for that. Thanks for your patience -Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
Joost van der Sluis wrote / nap?sal(a): On Wed, 2011-01-12 at 14:59 +0100, LacaK wrote: No. It is mandatory that you send/receive UTF8 to/from GUI LCL elements. As LCL elements are using TStringField.Text property, then this property should return UTF8String, right (not AnsiString in ANSI code page) ? If yes, then also TStringField must store internaly data in any unicode format (to not lose any characters), right ? So it can be UTF-8, UTF-16 or UTF-32 ... in all cases we must allocate space 4*[max.number of characters in field], right ? So in what encoding are string data stored now in TStringField ? The encoding you've specified. In the connection-string or some other database-server dependent setting. ok. But then there is problem in buffer size allocated for TStringField (ftString), does not ? See please at bug report: http://bugs.freepascal.org/view.php?id=17376 There is described situation with SQLite (TSQLite3Connectin) , which returns UTF-8 strings, so there is no problem in encoding, but problem is in fact, that for char(n),varchar(n) fields is created TStringField with Size=n and in record buffer is also allocated space with Size+1, where n is number of characters (not bytes). So truncation of data occurs, when writting UTF-8 encoded string into record buffer. So IMHO there must be: 1. allocated space in record buffer in size 4*TFieldDef.Size+1 (and so on) or 2. redefine meaning of Size property (as number of bytes not characters) and create fielddefs with Size*4 hm, according to http://docwiki.embarcadero.com/VCL/XE/en/DB.TStringField.Size is Size number of characters but according to http://docwiki.embarcadero.com/VCL/en/DB.TFieldDef.Size is Size number of bytes in underlaying database but TField is created from TFieldDef and TField.Size=TFieldDef.Size ... so isn't it curious ? Not that when you want to use UTF-16 (or 32) you have to use TWideStringFields. So TWideStringField is no-encoding-agnostic field (is it designed to be everytime UTF-16 encoded) ? -Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
L Yes in UNIX world it may be so (I do not know), L but in Windows ODBC we have no such possibility AFAIK Quote from Microsoft: The ODBC 3.5 (or higher) Driver Manager supports both ANSI and Unicode versions of all functions that accept pointers to character strings or SQLPOINTER in their arguments. The Unicode functions are implemented as functions (with a suffix of W), not as macros. The ANSI functions (which can be called with or without a suffix of A) are identical to the current ODBC API functions. ODBC 3.5 was launched around 2000-2001. But this approach will require changes in packages/odbc/src/odbcsql.inc like, does not ?: -pointer(SQLGetData) := GetProcedureAddress(ODBCLibraryHandle,'SQLGetData'); +pointer(SQLGetData) := GetProcedureAddress(ODBCLibraryHandle,'SQLGetDataW'); And I do not know how it affect compatibility for example in UNIX or if all ODBC drivers support this functionality. But also in this case we will get UTF-16 widestrings (in Windows) not UTF-8, does not ? -Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
Most probably it needs a flag to indicate that the ODBC must work in Unicode, and then dynamic link to *W functions if this flag is set. I think ODBC drivers since 2002+/- should have this set of APIs, but I had never used ODBC in my life... :) It seems, that Driver Manager automaticaly performs conversions on non-unicode driver. See http://web.datadirect.com/resources/odbc/unicode/odbc-driver.html ( and http://web.datadirect.com/resources/odbc/unicode/char-background.html ) If the driver is a non-Unicode driver, it cannot understand W function calls, and the Driver Manager must convert them to ANSI calls before sending them to the driver. Also it seems to me, that when you call ANSI version of ODBC API functions, then you receive data in ANSI encoding. If it is so, then it is always safe use ansitoutf8() (or UTF8Encode()) on receved data. L But also in this case we will get UTF-16 widestrings (in Windows) not L UTF-8, does not ? That's not important, you get unicode in the specified by the API format, then SQLConnector fills information in the expected target format (WideString, UTF8String over AnsiString, Raw bytes...). In Windows we get UTF-16, in Linux/UNIX we get UTF-8 So it is so, that in Windows is widestring=UTF-16 and in Linux/UNIX is widestring=UTF-8 string ? (So there is different meaning of widestring type on different OSeses ? I am only Windows developer, so I have no understanding of others OSeses ;-)) -Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
On Thu, 2011-01-13 at 09:15 +0100, LacaK wrote: Didn't I explain this to you and others a few times? ;-) If so, then please excuse me The database-components itself are encoding-agnostic. This means: encoding in = encoding out. So it is up to the developer what codepage he want to use. So TField.Text can have the encoding _you_ want. So, if you want to work with Lazarus, which uses UTF-8, you have to use UTF-8 encoded strings in your database. So this is answer, which i have looked for: In Lazarus TStringField MUST hold UTF-8 encoded strings. Not entirely true. You could also choose to bind the fields to some Lazarus-components manually, not using the db-components. (Tedit.Text := convertFunc(StringField.Text)) Or you can add a hook so that the .text property always does a conversion to UTF-8. First option can be used if you use a mediator or view. Second options I woudn't use. But I guess (I have theory), that in time, when Borland introduced TStringField, the design goal was: TStringField was designed for SBCS (because DataSize=Size+1) string data encoded in system ANSI code page and TWideStringField was designed for DBCS widestring (UTF-16) character data You have to be really careful in what you type, when you are writing about encodings. The above is nonsense, because of a very tiny mistake. If you compare DBCS widestring with UTF-16, you can also compare a stringfield with UTF-8. Exactly the same problem. (A character can be made up from more then one UTF-8 or UTF-16 codepoint) But TStringField's datasize by default is indeed Size+1. So if you use it t store UTF-8, you have to define the size as four times the field-size given by the database. Note that this is done in some cases. May be, that I was mistaken by this view. (or may be, that there is different approach in Delphi (no agnostic) and different in FPC (agnostic)?) No, Delphi does the same. Only newer Delphi versions have a string-type which contains the used encoding (details can be found in this thread), so can do some conversions for you. But that has nothing to do with the database-code. Also, you don't need it. People all over the world have used older Delphi versions all the time... (But offcourse, it's easier now) If there is some strange reason why you don't want the strings in your database to be UTF-8 encoded, SQL Server does not support UTF-8 (AFAIK) Rofl. You mean that Microsoft SQL Server can't handle unicode completely? If they say that in an advertisement they can forget that any big commercial client will choose their product... SQL Server provides non-UNICODE datatypes - char, varchar, text ie: TStringField and UNICODE (UCS-2) datatypes - nchar, nvarchar, ntext ie: TWideStringField. What does this have to do with your problem? Nothing. Only things what matters is what encoding is used while communicating with the client. (Which you can set) you have to convert the strings from the encoding your database uses to UTF-8 while reading data from the database. Luckily, you can specify the encoding of strings you want to use for most databases. Not only the encoding in which the strings are stored, but also the encoding which has to be used when you send and retrieve data from the database. And you can set this for each connection made. Ie: you can resolve the problem by changing the connection-string, or by adding some connection-parameter. Yes, it is true for example for MySQL or Firebird ODBC driver, but for SQL Server or PostgreSQL ODBC driver there are no such options Then that option has to be added. I think it's already possible but you simply don't know how. (Sql-Server is ODBC only, so that one is fixed. For firebird there's a 'serverencoding' parameter, or something like that. Postgres also has some setting. (but PostgreSQL ODBC driver exists in ANSI and UNICODE version) I saw that in an earlier message, but also this has nothing to do with your problem. You only need the different calls when you want to use UTF-8 in your fieldnames. (Or, and this one was tricky, in the connection-string. But this was more then a year ago.) SQL Server ODBC driver supports AutoTranslate, see: http://msdn.microsoft.com/en-us/library/ms130822.aspx SQL Server char, varchar, or text data sent to a client SQL_C_CHAR variable is converted from character to Unicode using the server ACP, then converted from Unicode to character using the client ACP. This is what you use when you set the encoding when you connect to the client. The solution to all your problems. As explained three times, in this message alone. In fact it's simple: incoming data=outgoing data. If you need UTF-8 encoding for the outgoing data (direct access to Lazarus controls) you have to select UTF-8 at the input. That's always more efficient than converting the data to/from any other encoding. And, luckily, you can instruct the Database-server which encoding
Re: [fpc-devel] TStringField, String and UnicodeString and UTF8String
On Thu, 2011-01-13 at 09:49 +0100, LacaK wrote: So IMHO there must be: 1. allocated space in record buffer in size 4*TFieldDef.Size+1 2. redefine meaning of Size property (as number of bytes not characters) and create fielddefs with Size*4 Yes, those are the possible solutions. Good thing about the second option, is that a user can do that on his own if he wants to use UTF-8, just create persistent fields with a field size of 4*the amount of characters. I'm not sure if we have to change this. It's a problem the programmer has to deal with, I think... hm, according to http://docwiki.embarcadero.com/VCL/XE/en/DB.TStringField.Size is Size number of characters but according to http://docwiki.embarcadero.com/VCL/en/DB.TFieldDef.Size is Size number of bytes in underlaying database Yes, that's indeed the problem. But there's also the .DataSize property, so we could use that. Maybe... if the pressure on the bugtracker gets too high, I'll bow and change this. (I think 25% of all existing db bugs are related to this and people who do not understand anything about encodings.) but TField is created from TFieldDef and TField.Size=TFieldDef.Size ... so isn't it curious ? Not that when you want to use UTF-16 (or 32) you have to use TWideStringFields. So TWideStringField is no-encoding-agnostic field (is it designed to be everytime UTF-16 encoded) ? No. It's designed to contain an array of two-bytes records. In fact you can use it to store UCS-2 data, but not UTF-16. Same story as with ansi/UTF-8. Joost. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel