Re: [fpc-devel] ansistrings and widestrings
- Original Message - From: "peter green" <[EMAIL PROTECTED]> To: "FPC developers' list" Sent: Sunday, January 09, 2005 11:45 PM Subject: RE: [fpc-devel] ansistrings and widestrings Type // Lenght paremeters are number of CHARS not bytes TWide2AnsiMove=function(source:pwidechar; srclen:SizeInt; dest:pansichar; destlen:SizeInt): SizeInt; TAnsi2WideMove=function(source:pansichar; srclen:SizeInt; dest:pwidechar; destlen:SizeInt): SizeInt; These functions should return actual number of characters in output. Returning ZERO should indicate insufficient destination size. yes theese would be workabable but they seem to me to be a horrible Cism whats wrong with twidestringtoansistring=procedure(const source : widestring;var dest : ansistring); tansistringtowidestring=procedure(const source : ansistring;var dest : widestring); Because we need to transform from PChar to WideString or from PWideChar to AnsiString or from Array [0..xx] of Char to WideString, etc. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
> Type > // Lenght paremeters are number of CHARS not bytes > TWide2AnsiMove=function(source:pwidechar; srclen:SizeInt; > dest:pansichar; > destlen:SizeInt): SizeInt; > TAnsi2WideMove=function(source:pansichar; srclen:SizeInt; > dest:pwidechar; > destlen:SizeInt): SizeInt; > > These functions should return actual number of characters in > output. Returning > ZERO should indicate insufficient destination size. yes theese would be workabable but they seem to me to be a horrible Cism whats wrong with twidestringtoansistring=procedure(const source : widestring;var dest : ansistring); tansistringtowidestring=procedure(const source : ansistring;var dest : widestring); ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
- Original Message - From: "Marco van de Voort" <[EMAIL PROTECTED]> To: "FPC developers' list" Sent: Sunday, January 09, 2005 2:53 PM Subject: Re: [fpc-devel] ansistrings and widestrings This is the level where multibyte characters can come in, so that just a Character can be different from any fixed-size data type, and that the same Character can have multiple representations - remember your umlaut example? Nonetheless the rules on the Character level at least are quite well defined, so that it's possible to implement according standard procedures for comparison and conversion. Of course these procedures require parameters like the language and the encoding of the characters, so that IMO exchangable and configurable classes are the best containers for characters. The problem with string-classes is that you loose all automatism. This complicates each and every operation where new strings are created from old ones. This is what Peter was hinting at. So, seems best approach here is to leave compiler generated code for equality and comparision as a plain binary comparision of bytes (btw. it's the way Delphi does) and introduce set of string handling functions that should be aware of language depended encoding. To current compiler implementation this means changing of Type TWide2AnsiMove=procedure(source:pwidechar;dest:pchar;len:SizeInt); TAnsi2WideMove=procedure(source:pchar;dest:pwidechar;len:SizeInt); to Type // Lenght paremeters are number of CHARS not bytes TWide2AnsiMove=function(source:pwidechar; srclen:SizeInt; dest:pansichar; destlen:SizeInt): SizeInt; TAnsi2WideMove=function(source:pansichar; srclen:SizeInt; dest:pwidechar; destlen:SizeInt): SizeInt; These functions should return actual number of characters in output. Returning ZERO should indicate insufficient destination size. In Windows WideCharToMultiByte can return needed number of characters in output buffer, but LIBICONV (http://www.gnu.org/software/libiconv/ - library suited for all UNIX'es) doesn't allow this. So common solution (if result of conversion will be stored in AnsiString or WideString) is just to enlarge output buffer untill TWide2AnsiMove / TAnsi2WideMove return non zero value. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
> peter green wrote: > > > it should be noted that pascal classes are really not suited to doing > > strings. > > IMO we should distinguish Strings, as containers, from Text as an > interpretation of data as, ahem, text of some language, in some > encoding, possibly with attributes... > > > to do strings with classes you really need language features which fpc > > doesn't have. > > Please explain? > > > doing strings with non garbage collected heap based classes would make > > something that was as painfull to work with as pchars and that was totally > > different from any string handling pascal has seen before. > > FPC has reference counted string and array types, so that GC is > available. Peter probably means that to make custom string types, you need to have a way to define operations and conversions. In Java, C++ this is possible afaik. In C++ because it is a template, in Java because compiler manages classes. > IMO we must distinguish between the handling of Characters, Strings and > Text. For the alphabets (character sets) of natural languages it should > be possible to implement functions to compare and convert characters; > such support often is built into the OS, for selected languages. That's problem 1: on Unix that part of the OS exists, but is not standarised. This not being standarised is the main reason for avoiding linking every program to these libs. > This is the level where multibyte characters can come in, so that just a > Character can be different from any fixed-size data type, and that the > same Character can have multiple representations - remember your umlaut > example? Nonetheless the rules on the Character level at least are quite > well defined, so that it's possible to implement according standard > procedures for comparison and conversion. > Of course these procedures > require parameters like the language and the encoding of the characters, > so that IMO exchangable and configurable classes are the best containers > for characters. The problem with string-classes is that you loose all automatism. This complicates each and every operation where new strings are created from old ones. This is what Peter was hinting at. Personally, I still think it would be best to have 2 types of widestrings (UTF8 - UTF16), with automatic conversions between them. GNU is a UTF8 world, Windows typically uses an own encoding that is more UTF16-like) (UTF32 is rarely used, since afaik it is mostly for dead languages and uncommon writing styles of east Asian languages. Moreover it indeed afaik doesn't hold the often cited advantage that it has fixed length chars. diacritic modifiers exist here too. However since most combinations also have a formal codepoint, I don't know if that can be solved (e.g. by merging them) ) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
peter green wrote: > it should be noted that pascal classes are really not suited to doing > strings. IMO we should distinguish Strings, as containers, from Text as an interpretation of data as, ahem, text of some language, in some encoding, possibly with attributes... > to do strings with classes you really need language features which fpc > doesn't have. Please explain? > doing strings with non garbage collected heap based classes would make > something that was as painfull to work with as pchars and that was totally > different from any string handling pascal has seen before. FPC has reference counted string and array types, so that GC is available. > just as pascal doesn't consider two strings with different cases to be equal > it should probbablly not consider two strings of unicode code points to be > equal unless they are binary equivilent. That's one of the differences between strings and text. All comparable data types must have associated comparison functions. For numbers and strings the standard comparison functions are part of the language (operators), which usually do a simple binary compare. For other data types such operators can be defined as appropriate. It should be noted that a comparison for anything but (strict) equality requires interpretation rules for the data types. E.g. comparing even ordinal numbers depends on the byte order of the machine, comparing strings depends on many more attributes, like mappings for upper/lower case. That's why a programming language, for itself, will supply only "primitive" string comparisons, that have reasonable restrictions so that an implementation should be possible for any platform. > conversion between ansistring and widestring should be done by functions > that take one and returns the other (use a const param to avoid the implicit > try-finally) so that no limitations are put on how the conversion is done. This applies to all string handling procedures. A modification of non-const string parameters opens a can of worms (aliasing...)! > Theese functions should be indirected through procvars so that the default > fallback versions can be replaced by versions supplied by a unit which > provides proper internationalisation. (Inter)nationalization goes far beyond any "standard" features. Dealing with natural languages IMO requires more than only dictionaries and hard-coded translation rules. Every natural language can have their own rules, how e.g. the words in a message must be modified or rearranged when message arguments shall be inserted into the text. IMO we must distinguish between the handling of Characters, Strings and Text. For the alphabets (character sets) of natural languages it should be possible to implement functions to compare and convert characters; such support often is built into the OS, for selected languages. This is the level where multibyte characters can come in, so that just a Character can be different from any fixed-size data type, and that the same Character can have multiple representations - remember your umlaut example? Nonetheless the rules on the Character level at least are quite well defined, so that it's possible to implement according standard procedures for comparison and conversion. Of course these procedures require parameters like the language and the encoding of the characters, so that IMO exchangable and configurable classes are the best containers for characters. Strings can be considered as arrays of Characters, so that the string handling procedures can use the character handling procedures. Everything else, that requires more than processing an stream of individual characters, is beyond the scope of standard procedures. Here it can become problematic when a string just contains words from different languages, because then an automatic detection of the language and according rules can not be guaranteed. That's why I hold the programmer liable for the correct description of whatever he puts into a string object. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
- Original Message - From: "peter green" <[EMAIL PROTECTED]> To: "FPC developers' list" Sent: Friday, January 07, 2005 7:24 PM Subject: RE: [fpc-devel] ansistrings and widestrings it should be noted that pascal classes are really not suited to doing strings. to do strings with classes you really need language features which fpc doesn't have. doing strings with non garbage collected heap based classes would make something that was as painfull to work with as pchars and that was totally different from any string handling pascal has seen before. Yes, classes are not suitable here, but FPC already allows mechanizm to redefine string handling with Get / SetWideStringManager. This can be extended / reworked to include short, wide and ansi string comparition routines. just as pascal doesn't consider two strings with different cases to be equal it should probbablly not consider two strings of unicode code points to be equal unless they are binary equivilent. But comparision is not only equal / nonequal, but "bigger" and "lesser" too ! ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
it should be noted that pascal classes are really not suited to doing strings. to do strings with classes you really need language features which fpc doesn't have. doing strings with non garbage collected heap based classes would make something that was as painfull to work with as pchars and that was totally different from any string handling pascal has seen before. just as pascal doesn't consider two strings with different cases to be equal it should probbablly not consider two strings of unicode code points to be equal unless they are binary equivilent. conversion between ansistring and widestring should be done by functions that take one and returns the other (use a const param to avoid the implicit try-finally) so that no limitations are put on how the conversion is done. Theese functions should be indirected through procvars so that the default fallback versions can be replaced by versions supplied by a unit which provides proper internationalisation. > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of DrDiettrich > Sent: 07 January 2005 15:06 > To: FPC developers' list > Subject: Re: [fpc-devel] ansistrings and widestrings > > > Florian Klaempfl wrote: > > > > The only universal international representation for strings is Unicode > > > (currently 32 bit), that doesn't require any conversions. > > > > That's not true. E.g. the german umlauts can be represented by 2 chars > > when using UTF-32 (the char and the two dots), same apply to a lot of > > other languages. > > Okay, this is where I didn't understand the difference between code > points and whatsoever. Doesn't in the umlaut and accented case exist a > unique glyph and according code, that could be used in the first place? > In other languages (Arabic...) the glyph may vary with the context, here > I have no idea how to compare such text, but the native writers > (speakers) of such glyphs should know ;-) > > > Encoding isn't the main problem, you need dedicated procecures and > > functions for unicode comparision, upper/lower conversion etc. > > Agreed, these will become the string class methods. It may be necessary > to partition Unicode into code pages, with different methods for > comparison etc. > > In the worst case, if we cannot find or agree about a so-far unique > representation for text, an "uncomparable" value has to become a valid > result of a comparison. > > > > To achive this platfrom independend is very hard ... > > How that? I agree that here the existence of definitely > compatible/portable OS services is not guaranteed. But when the methods > have to be implemented for platforms that do not have such services at > all, then these implementations can be used on all other platforms as > well. > > > All in all I'd say that we do not intend to implement a text processing > or translation system. What we can do is to define a string or text > class, that contains text in a well defined form, for processing with > all specified methods. The key point is the import of text into an > object of any such class. If no appropriate class has been implemented, > the import is simply impossible. Inside, i.e. between these classes, all > the methods should work. Perhaps with graceful "uncomparable" or > "unconvertable" results, when somebody insists in using incompletly > implemented classes. > We don't want the impossible, the doable will be sufficient ;-) > > DoDi > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
Florian Klaempfl wrote: > > The only universal international representation for strings is Unicode > > (currently 32 bit), that doesn't require any conversions. > > That's not true. E.g. the german umlauts can be represented by 2 chars > when using UTF-32 (the char and the two dots), same apply to a lot of > other languages. Okay, this is where I didn't understand the difference between code points and whatsoever. Doesn't in the umlaut and accented case exist a unique glyph and according code, that could be used in the first place? In other languages (Arabic...) the glyph may vary with the context, here I have no idea how to compare such text, but the native writers (speakers) of such glyphs should know ;-) > Encoding isn't the main problem, you need dedicated procecures and > functions for unicode comparision, upper/lower conversion etc. Agreed, these will become the string class methods. It may be necessary to partition Unicode into code pages, with different methods for comparison etc. In the worst case, if we cannot find or agree about a so-far unique representation for text, an "uncomparable" value has to become a valid result of a comparison. > To achive this platfrom independend is very hard ... How that? I agree that here the existence of definitely compatible/portable OS services is not guaranteed. But when the methods have to be implemented for platforms that do not have such services at all, then these implementations can be used on all other platforms as well. All in all I'd say that we do not intend to implement a text processing or translation system. What we can do is to define a string or text class, that contains text in a well defined form, for processing with all specified methods. The key point is the import of text into an object of any such class. If no appropriate class has been implemented, the import is simply impossible. Inside, i.e. between these classes, all the methods should work. Perhaps with graceful "uncomparable" or "unconvertable" results, when somebody insists in using incompletly implemented classes. We don't want the impossible, the doable will be sufficient ;-) DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
DrDiettrich wrote: peter green wrote: ok i see a MAJOR problem with the semantics of those functions. they assume that one widechar is equivilent to one ansichar (that is the source count of widechars will equal the destination count of ansichars or the source count of widechars will equal the destination count of ansichars). this is simply not the case for many encodings. (utf-8 sjis euc to name just a few) I came across such problems in another project (CrossPoint). IMO the best solution is a separation into true fixed-char strings (1, 2, 4? byte/char), and a true string class for more general encodings. The string class(es) then also can include proper support for code pages, MBCS, 7-bit codes, MIME etc. The only universal international representation for strings is Unicode (currently 32 bit), that doesn't require any conversions. That's not true. E.g. the german umlauts can be represented by 2 chars when using UTF-32 (the char and the two dots), same apply to a lot of other languages. UTF and other UTF-8 is unicode as well, unicode is a standard which decribes char mappings and encodings besides other things. encodings can save memory, but only at the cost of runtime overhead, that's why I'd wrap these into classes. Delphi uses AnsiString for both single and multi byte character strings, and I'm not sure whether WideChar (as used by Windows) is Unicode-16 or UTF-16. In international applications (mail!) the handling of such strings can become a mess, when the assumptions about the encoding of some string (code page...) don't hold. When consequently records are used to hold strings together with an indication of the actual encoding, then a dedicated standard string class would be a better solution. Encoding isn't the main problem, you need dedicated procecures and functions for unicode comparision, upper/lower conversion etc. To achive this platfrom independend is very hard ... ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
peter green wrote: > > ok i see a MAJOR problem with the semantics of those functions. > > they assume that one widechar is equivilent to one ansichar (that is the > source count of widechars will equal the destination count of ansichars or > the source count of widechars will equal the destination count of > ansichars). > > this is simply not the case for many encodings. (utf-8 sjis euc to name just > a few) I came across such problems in another project (CrossPoint). IMO the best solution is a separation into true fixed-char strings (1, 2, 4? byte/char), and a true string class for more general encodings. The string class(es) then also can include proper support for code pages, MBCS, 7-bit codes, MIME etc. The only universal international representation for strings is Unicode (currently 32 bit), that doesn't require any conversions. UTF and other encodings can save memory, but only at the cost of runtime overhead, that's why I'd wrap these into classes. Delphi uses AnsiString for both single and multi byte character strings, and I'm not sure whether WideChar (as used by Windows) is Unicode-16 or UTF-16. In international applications (mail!) the handling of such strings can become a mess, when the assumptions about the encoding of some string (code page...) don't hold. When consequently records are used to hold strings together with an indication of the actual encoding, then a dedicated standard string class would be a better solution. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
Peter Vreman wrote: in wondows terminology (which i presume is where the name ansistring comes from) the windows code page which is often refered to in documentation as the ansi code page CAN be multi byte. http://www.microsoft.com/globaldev/reference/WinCP.mspx more generally i belive an ansistring is usually intended to represent text in the platforms local encoding. Whilst a widestring is meant to represent text in utf-16. The platforms local encoding may be a single byte encodeing (iso-8859-? windows-125? etc) it may be a legacy mixed width encoding (EUC-?? SHIFT-JIS BIG5 etc) or it may be a unicode transformation format which is a superset of ascii (utf-8). now for dependency reasons i belive that the default conversion functions should remain a "dumb fallback" BUT i also belive that the function prototypes should be designed in such a way as to allow the conversion routines to be replaced with ones that can sesiblly handle the local encoding. i've created a page on the wiki for this issue at http://www.freepascal.org/wiki/index.php/Widestrings You are welcome to supply patches that fixes the prototypes and new units that support more encoding/decoding routines. I think we should introduce a class widestringmanager :) Lower, upper, comparing etc. needs also to take care of unicode encodings. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
> in wondows terminology (which i presume is where the name ansistring comes > from) the windows code page which is often refered to in documentation as > the ansi code page CAN be multi byte. > > http://www.microsoft.com/globaldev/reference/WinCP.mspx > > more generally i belive an ansistring is usually intended to represent > text > in the platforms local encoding. Whilst a widestring is meant to represent > text in utf-16. > > The platforms local encoding may be a single byte encodeing (iso-8859-? > windows-125? etc) it may be a legacy mixed width encoding (EUC-?? > SHIFT-JIS > BIG5 etc) or it may be a unicode transformation format which is a superset > of ascii (utf-8). > > now for dependency reasons i belive that the default conversion functions > should remain a "dumb fallback" BUT i also belive that the function > prototypes should be designed in such a way as to allow the conversion > routines to be replaced with ones that can sesiblly handle the local > encoding. > > i've created a page on the wiki for this issue at > http://www.freepascal.org/wiki/index.php/Widestrings You are welcome to supply patches that fixes the prototypes and new units that support more encoding/decoding routines. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
> PPS. AFAIK UTF-8 is not used internally in any OS - it's only > used for storing > UNICODE text in more compact form - web site authors really like it. i belive a lot of linux distros are switching to it for the console at least for less common languages i don't know how gui stuff on linux handles text. The windows routines for going from utf-16 to local codesets and back can also go from utf-16 to utf-7 and utf-8 and back but i don't think windows itself actually makes any real use of those encodings. UTF-8 is smaller than UTF-16 in some cases larger in others and about the same in still others it largely depends on what code points dominate the text. An appropriate national encoding will usually always beat both of them if it can represent the needed code points. mainly $00-$7F utf-8 : 1 byte utf-16: 2 bytes utf-32 4 bytes. mainly $80-$0007FF utf-8 : 2 bytes utf-16: 2 bytes utf-32 4 bytes. mainly $000800-$00 utf-8 : 3 bytes utf-16: 2 bytes utf-32 4 bytes. mainly $01-$10 utf-8 : 4 bytes utf-16: 4 bytes utf-32 4 bytes. the net result is that utf-8 tends to win for largely latin languages UTF-16 tends to win for largely ideographic languages and they are about on a par for everything else. utf-32 nearly always loses to both (though it does have a large spare codespace which can be used for special meanings internal to the app). the main advatages of utf-8 over utf-16 are 1: is a superset of 7 bit ascii 2: its not peppperd with 0 bytes. 3: any charachtor can ONLY be represented by 1 byte pattern and that byte patten can ONLY represent that charachtor (it can't be a part of another charachtor) 4: its easy to resync a badly cut/joined stream (if you cut a utf-16 stream in the middle of a charachtor on of the peices will be total garbage). the net result is that most code designed to deal with "ascii with extentions" can be fed utf-8 and it will usually work fine or only require minimal changes. i still belive that the best way to handle ansistring<-->widestring conversion is to use a fallback conversion (either 7 bit ascii or iso-8859-1) by default and then provide units that override the conversion with versions based on the local charset of the environment or a charset specified by the application coder. Unfortunately as i have said whilst there is an interface in place for overriding the conversion it is currently only usable where the local code is single byte rather than mixed width. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
Firstly: I agree that Wide2AnsiMoveProc and Ansi2WideMoveProc should take size of resulting string. Next: I was wrong about ansistrings - on Windows their are (PCHAR's) used (until WinNT arrived) in far east localized versions coupled with multibyte encoding. So currenltly for legacy applications multibyte encoded character sets are supported in any WinNT box. PS. I hope mine patch (bug 3451) extending Widestring support in compiler will finally be applied to CVS and we can proceed with RTL modifications to support more extended ansi to wide strings conversions. ;-) PPS. AFAIK UTF-8 is not used internally in any OS - it's only used for storing UNICODE text in more compact form - web site authors really like it. - Original Message - From: "peter green" <[EMAIL PROTECTED]> To: "FPC developers' list" Sent: Thursday, January 06, 2005 12:19 AM Subject: RE: [fpc-devel] ansistrings and widestrings in wondows terminology (which i presume is where the name ansistring comes from) the windows code page which is often refered to in documentation as the ansi code page CAN be multi byte. http://www.microsoft.com/globaldev/reference/WinCP.mspx more generally i belive an ansistring is usually intended to represent text in the platforms local encoding. Whilst a widestring is meant to represent text in utf-16. The platforms local encoding may be a single byte encodeing (iso-8859-? windows-125? etc) it may be a legacy mixed width encoding (EUC-?? SHIFT-JIS BIG5 etc) or it may be a unicode transformation format which is a superset of ascii (utf-8). now for dependency reasons i belive that the default conversion functions should remain a "dumb fallback" BUT i also belive that the function prototypes should be designed in such a way as to allow the conversion routines to be replaced with ones that can sesiblly handle the local encoding. i've created a page on the wiki for this issue at http://www.freepascal.org/wiki/index.php/Widestrings ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
in wondows terminology (which i presume is where the name ansistring comes from) the windows code page which is often refered to in documentation as the ansi code page CAN be multi byte. http://www.microsoft.com/globaldev/reference/WinCP.mspx more generally i belive an ansistring is usually intended to represent text in the platforms local encoding. Whilst a widestring is meant to represent text in utf-16. The platforms local encoding may be a single byte encodeing (iso-8859-? windows-125? etc) it may be a legacy mixed width encoding (EUC-?? SHIFT-JIS BIG5 etc) or it may be a unicode transformation format which is a superset of ascii (utf-8). now for dependency reasons i belive that the default conversion functions should remain a "dumb fallback" BUT i also belive that the function prototypes should be designed in such a way as to allow the conversion routines to be replaced with ones that can sesiblly handle the local encoding. i've created a page on the wiki for this issue at http://www.freepascal.org/wiki/index.php/Widestrings > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Alexey > Barkovoy > Sent: 05 January 2005 19:55 > To: FPC developers' list > Subject: Re: [fpc-devel] ansistrings and widestrings > > > Well functions are called ANSI to unicode and vice versa. ANSI is > always single > byte; by unicode people usually refer to utf16, not multibyte > encoding and both > Delphi and FPC define WideString as double byte strings. So semantically > functions do that is required. IMHO then assigning widestring to > ansistring > noone should expect multibyte encoded result. Then you need utf-8 > you should > call special functions. > > - Original Message - > From: "peter green" <[EMAIL PROTECTED]> > To: "FPC developers' list" > Sent: Wednesday, January 05, 2005 8:32 PM > Subject: RE: [fpc-devel] ansistrings and widestrings > > > > ok i see a MAJOR problem with the semantics of those functions. > > > > they assume that one widechar is equivilent to one ansichar (that is the > > source count of widechars will equal the destination count of > ansichars or > > the source count of widechars will equal the destination count of > > ansichars). > > > > this is simply not the case for many encodings. (utf-8 sjis euc > to name just > > a few) > > > > > >> -----Original Message- > >> From: [EMAIL PROTECTED] > >> [mailto:[EMAIL PROTECTED] Behalf Of Michael Van > >> Canneyt > >> Sent: 05 January 2005 16:11 > >> To: FPC developers' list > >> Subject: RE: [fpc-devel] ansistrings and widestrings > >> > >> > >> On Wed, 5 Jan 2005, peter green wrote: > >> > >> > where are theese default versions located in the code? > >> > > >> > >> In the inc directory of the rtl. wstrings.inc > >> > >> procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt); > >> procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt); > >> > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
Well functions are called ANSI to unicode and vice versa. ANSI is always single byte; by unicode people usually refer to utf16, not multibyte encoding and both Delphi and FPC define WideString as double byte strings. So semantically functions do that is required. IMHO then assigning widestring to ansistring noone should expect multibyte encoded result. Then you need utf-8 you should call special functions. - Original Message - From: "peter green" <[EMAIL PROTECTED]> To: "FPC developers' list" Sent: Wednesday, January 05, 2005 8:32 PM Subject: RE: [fpc-devel] ansistrings and widestrings ok i see a MAJOR problem with the semantics of those functions. they assume that one widechar is equivilent to one ansichar (that is the source count of widechars will equal the destination count of ansichars or the source count of widechars will equal the destination count of ansichars). this is simply not the case for many encodings. (utf-8 sjis euc to name just a few) -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Michael Van Canneyt Sent: 05 January 2005 16:11 To: FPC developers' list Subject: RE: [fpc-devel] ansistrings and widestrings On Wed, 5 Jan 2005, peter green wrote: > where are theese default versions located in the code? > In the inc directory of the rtl. wstrings.inc procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt); procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt); ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
ok i see a MAJOR problem with the semantics of those functions. they assume that one widechar is equivilent to one ansichar (that is the source count of widechars will equal the destination count of ansichars or the source count of widechars will equal the destination count of ansichars). this is simply not the case for many encodings. (utf-8 sjis euc to name just a few) > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Michael Van > Canneyt > Sent: 05 January 2005 16:11 > To: FPC developers' list > Subject: RE: [fpc-devel] ansistrings and widestrings > > > > > On Wed, 5 Jan 2005, peter green wrote: > > > where are theese default versions located in the code? > > > > In the inc directory of the rtl. wstrings.inc > > procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt); > procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt); > > Michael. > > > > > -Original Message- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] Behalf Of Alexey > > > Barkovoy > > > Sent: 05 January 2005 07:36 > > > To: FPC developers' list > > > Subject: Re: [fpc-devel] ansistrings and widestrings > > > > > > > > > > if i do ansistringvar := widestringvar or widestringvar := > ansistringvar > > > > what does the compiler do? > > > > > > > > 1: use the systems default encoding (if so obtained from where?) > > > > 2: use utf-8 > > > > 3: use iso-8859-1 > > > > 4: use something else? > > > > > > > > furthermore if the encoding used is one not capable of > representing all > > > > unicode code points what are the reduction rules used in the > > > conversion from > > > > widestring to ansistring? > > > > > > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc > > > functions which > > > can be reassigned by user. Currently their are just maping lower > > > 128 chars in > > > one representation to other. On Windows there are system > > > functions that can be > > > used to do conversion: MultiByteToWideChar and > > > WideCharToMultiByte - these > > > functions can take into account specified or globally set code page. > > > > > > > > > ___ > > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > > http://lists.freepascal.org/mailman/listinfo/fpc-devel > > > > > > ___ > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > http://lists.freepascal.org/mailman/listinfo/fpc-devel > > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
i found something slightly worrying in that code @-8 : SizeInt for reference count; @-4 : SizeInt for size; @: String + Terminating #0; a Sizeint isn't always 4 bytes!! > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Michael Van > Canneyt > Sent: 05 January 2005 16:11 > To: FPC developers' list > Subject: RE: [fpc-devel] ansistrings and widestrings > > > > > On Wed, 5 Jan 2005, peter green wrote: > > > where are theese default versions located in the code? > > > > In the inc directory of the rtl. wstrings.inc > > procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt); > procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt); > > Michael. > > > > > -Original Message- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] Behalf Of Alexey > > > Barkovoy > > > Sent: 05 January 2005 07:36 > > > To: FPC developers' list > > > Subject: Re: [fpc-devel] ansistrings and widestrings > > > > > > > > > > if i do ansistringvar := widestringvar or widestringvar := > ansistringvar > > > > what does the compiler do? > > > > > > > > 1: use the systems default encoding (if so obtained from where?) > > > > 2: use utf-8 > > > > 3: use iso-8859-1 > > > > 4: use something else? > > > > > > > > furthermore if the encoding used is one not capable of > representing all > > > > unicode code points what are the reduction rules used in the > > > conversion from > > > > widestring to ansistring? > > > > > > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc > > > functions which > > > can be reassigned by user. Currently their are just maping lower > > > 128 chars in > > > one representation to other. On Windows there are system > > > functions that can be > > > used to do conversion: MultiByteToWideChar and > > > WideCharToMultiByte - these > > > functions can take into account specified or globally set code page. > > > > > > > > > ___ > > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > > http://lists.freepascal.org/mailman/listinfo/fpc-devel > > > > > > ___ > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > http://lists.freepascal.org/mailman/listinfo/fpc-devel > > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
On Wed, 5 Jan 2005, peter green wrote: > where are theese default versions located in the code? > In the inc directory of the rtl. wstrings.inc procedure Wide2AnsiMove(source:pwidechar;dest:pchar;len:SizeInt); procedure Ansi2WideMove(source:pchar;dest:pwidechar;len:SizeInt); Michael. > > > -Original Message- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] Behalf Of Alexey > > Barkovoy > > Sent: 05 January 2005 07:36 > > To: FPC developers' list > > Subject: Re: [fpc-devel] ansistrings and widestrings > > > > > > > if i do ansistringvar := widestringvar or widestringvar := ansistringvar > > > what does the compiler do? > > > > > > 1: use the systems default encoding (if so obtained from where?) > > > 2: use utf-8 > > > 3: use iso-8859-1 > > > 4: use something else? > > > > > > furthermore if the encoding used is one not capable of representing all > > > unicode code points what are the reduction rules used in the > > conversion from > > > widestring to ansistring? > > > > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc > > functions which > > can be reassigned by user. Currently their are just maping lower > > 128 chars in > > one representation to other. On Windows there are system > > functions that can be > > used to do conversion: MultiByteToWideChar and > > WideCharToMultiByte - these > > functions can take into account specified or globally set code page. > > > > > > ___ > > fpc-devel maillist - fpc-devel@lists.freepascal.org > > http://lists.freepascal.org/mailman/listinfo/fpc-devel > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/mailman/listinfo/fpc-devel > ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
RE: [fpc-devel] ansistrings and widestrings
where are theese default versions located in the code? > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf Of Alexey > Barkovoy > Sent: 05 January 2005 07:36 > To: FPC developers' list > Subject: Re: [fpc-devel] ansistrings and widestrings > > > > if i do ansistringvar := widestringvar or widestringvar := ansistringvar > > what does the compiler do? > > > > 1: use the systems default encoding (if so obtained from where?) > > 2: use utf-8 > > 3: use iso-8859-1 > > 4: use something else? > > > > furthermore if the encoding used is one not capable of representing all > > unicode code points what are the reduction rules used in the > conversion from > > widestring to ansistring? > > Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc > functions which > can be reassigned by user. Currently their are just maping lower > 128 chars in > one representation to other. On Windows there are system > functions that can be > used to do conversion: MultiByteToWideChar and > WideCharToMultiByte - these > functions can take into account specified or globally set code page. > > > ___ > fpc-devel maillist - fpc-devel@lists.freepascal.org > http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] ansistrings and widestrings
if i do ansistringvar := widestringvar or widestringvar := ansistringvar what does the compiler do? 1: use the systems default encoding (if so obtained from where?) 2: use utf-8 3: use iso-8859-1 4: use something else? furthermore if the encoding used is one not capable of representing all unicode code points what are the reduction rules used in the conversion from widestring to ansistring? Compiler internally uses Wide2AnsiMoveProc and Ansi2WideMoveProc functions which can be reassigned by user. Currently their are just maping lower 128 chars in one representation to other. On Windows there are system functions that can be used to do conversion: MultiByteToWideChar and WideCharToMultiByte - these functions can take into account specified or globally set code page. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] ansistrings and widestrings
if i do ansistringvar := widestringvar or widestringvar := ansistringvar what does the compiler do? 1: use the systems default encoding (if so obtained from where?) 2: use utf-8 3: use iso-8859-1 4: use something else? furthermore if the encoding used is one not capable of representing all unicode code points what are the reduction rules used in the conversion from widestring to ansistring? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel