Re: [fpc-devel] Unicode support - for the 20th time... ;-)
In our previous episode, Dani?l Mantione said: Full Unicode support is for FPC 2.4. If you need it today, widestrings are your best option. Is it? Because that might mean yet another 2.2 fixes branch release to fix up the delay that this will cause to 2.4 ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
In our previous episode, Florian Klaempfl said: They add it only because they insist on using utf-8 :) That's perfectly normal on *nix. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
Op Fri, 21 Nov 2008, schreef Marco van de Voort: In our previous episode, Dani?l Mantione said: Full Unicode support is for FPC 2.4. If you need it today, widestrings are your best option. Is it? Because that might mean yet another 2.2 fixes branch release to fix up the delay that this will cause to 2.4 People were complaining against the current FPC, not being aware of the new UTF16 string type in FPC 2.3. Perhaps it indeed needs postponing to an even later release, but at any Unicode support should not be expected for 2.2. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
In our previous episode, Dani?l Mantione said: If you want to help, we need to implement the Delphi 2009 encoding aware string type, both runtime support as well as the compiler support. A previous discussion showed that this also breaks a lot of old code and is not really nice. As I understand it, the incompatibility from Delphi 2009 comes from the fact that and char string by default has becomes 2 bytes, not by adding encoding information to the type. Yes. The added strings are perfectly compatible. (ansistring for 1 byte encodings unicodestring for 2 byte encodings) The incompatible part comes from switching the default string/char type to the 2 byte variants. So a better concept seems to have a dedicated type for any possible Coding (ANSISTring of course locale-depending, UTF8String, UTF16String, maybe UCS2String, too) and let the user choose (e.g. by a {$ compiler option) which one he want to be used for String and WideString. This would allow for simple compiler magic to perform any necessary conversion (including assigning constants). Isn't this the same?? Typing wise it is the same. (except for the nonexistance of UCS2), implementation wise not. But Michael obviously hasn't read the PDF. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Thu, Nov 20, 2008 at 9:05 AM, Daniël Mantione [EMAIL PROTECTED] wrote: On the other hand Lazarus may want to move to a string depending on platform too, to attract both Delphi 2009 and Delphi = 2009 users. I don't see this change any time soon, because it would break too much of existing code. If Delphi 2009 becomes really popular and there is a large need to migrate projects to Lazarus we may think of a solution. At the moment a fully working UTF8String is what we need. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
http://wiki.freepascal.org/FPC_Unicode_support#Roadmap_of_RTL_Unicode_support This page does not talk about UTF8Strings being counted in code elements vs in code points. I don't consider it understood that they in any case are counted in code elements. IMHO this should be seriously discussed and a solution should be found that the user can select either way to be able to do either fast code or not break old code. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Re: Unicode and Lazarus
On Thu, Nov 20, 2008 at 9:09 AM, Mattias Gärtner [EMAIL PROTECTED] wrote: So the roadmap from LCL pov is: - a RTL using unicode strings - changing the string types in the lazarus code - a fpc release with the unicode RTL From what I've heard about the Unicode RTL fpc developers recomend that we build our own set of routines/classes using UTF8String. Later they could be added to Free Pascal. This is more or less what we are doing at the moment, so we should just continue in the same direction. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Thu, Nov 20, 2008 at 9:05 AM, Daniël Mantione [EMAIL PROTECTED] wrote: There will be a real UTF8string, i.e. ansistring with UTF-8 encoding as part of type information, this will help Lazarus users to get rid of the utf8encode/utf8decode. When? Is this planned for 2.4? thanks, -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
Op Fri, 21 Nov 2008, schreef Felipe Monteiro de Carvalho: On Thu, Nov 20, 2008 at 9:05 AM, Daniël Mantione [EMAIL PROTECTED] wrote: There will be a real UTF8string, i.e. ansistring with UTF-8 encoding as part of type information, this will help Lazarus users to get rid of the utf8encode/utf8decode. When? Is this planned for 2.4? See Marco's e-mail [EMAIL PROTECTED] from 10:21. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Fri, Nov 21, 2008 at 7:30 AM, Florian Klaempfl [EMAIL PROTECTED] wrote: This is easily said, please create examples and descriptions how fully working is defined. // Should actually convert from widestring to utf-8 when using encoding utf-8 programa utf8test1; {$encoding utf-8} // or is it utf8? var Str: UTF8String; begin Str := ção; if Length(Str) = 5 then Success else Fail; end; // Should work on all platforms. Passing the UTF8String to a routine that requires // ansistring should do the proper conversion programa utf8test2; {$encoding utf-8} // or is it utf8? var Str: UTF8String; begin Str := ção; WriteLn(Str); end; -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
On Fri, Nov 21, 2008 at 7:01 AM, Marco van de Voort [EMAIL PROTECTED] wrote: Is it? Because that might mean yet another 2.2 fixes branch release to fix up the delay that this will cause to 2.4 Another 2.2 fixes branch release is a good idea, because it contains a fix for static methods which is necessary for Cocoa projects. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
Felipe Monteiro de Carvalho schreef: On Fri, Nov 21, 2008 at 7:01 AM, Marco van de Voort [EMAIL PROTECTED] wrote: Is it? Because that might mean yet another 2.2 fixes branch release to fix up the delay that this will cause to 2.4 Another 2.2 fixes branch release is a good idea, because it contains a fix for static methods which is necessary for Cocoa projects. When Marco said yet another 2.2 fixes branch release, he meant 2.2.6. Vincent ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
On Fri, Nov 21, 2008 at 7:30 AM, Michael Schnell [EMAIL PROTECTED] wrote: This page does not talk about UTF8Strings being counted in code elements vs in code points. I don't consider it understood that they in any case are counted in code elements. IMHO this should be seriously discussed and a solution should be found that the user can select either way to be able to do either fast code or not break old code. I prefer it to be counted in bytes. If it is counted in Bytes then I can build a routine that counts in real chars. And we already have a lot of code to handle utf-8 inside ansisstring which depends on that. Counting the elements in real chars is very ineficient. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
// Should actually convert from widestring to utf-8 when using encoding utf-8 programa utf8test1; In fact it should automatically convert (as correctly as possible) between all available string types (ANSI, UTF8, UTF16). Should provide appropriate char types for all available string types. User selectable way of element counting in utf-strings (code element counting or code point counting) The RTL would need to provide the appropriate objects (e.g. StringList) for all available string types. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
On Fri, Nov 21, 2008 at 7:43 AM, Vincent Snijders [EMAIL PROTECTED] wrote: When Marco said yet another 2.2 fixes branch release, he meant 2.2.6. Ah, ok ... =) So my commend would then be changed to: Unicode is what is most discussed and needed at the moment. What is the point in making a major release without any major change? For me it doesn~t matter if it will take time. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
Felipe Monteiro de Carvalho schrieb: On Fri, Nov 21, 2008 at 7:30 AM, Florian Klaempfl [EMAIL PROTECTED] wrote: This is easily said, please create examples and descriptions how fully working is defined. // Should actually convert from widestring to utf-8 when using encoding utf-8 programa utf8test1; {$encoding utf-8} // or is it utf8? var Str: UTF8String; begin Str := ção; if Length(Str) = 5 then Success else Fail; end; // Should work on all platforms. Passing the UTF8String to a routine that requires // ansistring should do the proper conversion programa utf8test2; {$encoding utf-8} // or is it utf8? var Str: UTF8String; begin Str := ção; WriteLn(Str); end; Big deal, I simply enable operator overloading for unique string types to get this working, then everybody is happy and we've unicode support? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Fri, Nov 21, 2008 at 7:30 AM, Florian Klaempfl [EMAIL PROTECTED] wrote: This is easily said, please create examples and descriptions how fully working is defined. It would be really good if there was a guide, preferably in the wiki, to explain how to add a new test case to Free Pascal. I have already some test cases in mind, like making sure static methods compile (an error in 2.2.2) and then after some discussion the utf-8 test cases. At the moment I can't add the test cases because I don't know how to. thanks, -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
I prefer it to be counted in bytes. If it is counted in Bytes then I can build a routine that counts in real chars. And we already have a lot of code to handle utf-8 inside ansisstring which depends on that. Counting the elements in real chars is very ineficient. This is commonly agreed, But counting in code elements breaks old code counting in code points sometimes is more handy. That is why I vote for making the default syntax (s[i], pos(), copy(), ...) user selectable, while of course providing dedicated functions for both flavors. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
Felipe Monteiro de Carvalho schrieb: On Fri, Nov 21, 2008 at 7:30 AM, Florian Klaempfl [EMAIL PROTECTED] wrote: This is easily said, please create examples and descriptions how fully working is defined. It would be really good if there was a guide, preferably in the wiki, to explain how to add a new test case to Free Pascal. I have already some test cases in mind, like making sure static methods compile (an error in 2.2.2) I'am quite sure I made a test case when I fixed it. and then after some discussion the utf-8 test cases. At the moment I can't add the test cases because I don't know how to. Just create a program which returns 0 if everything was ok and another value if it fails and attach it to a bug report. The program might only depend on FPC units, not LCL or anything else. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Fri, Nov 21, 2008 at 7:49 AM, Florian Klaempfl [EMAIL PROTECTED] wrote: Big deal, I simply enable operator overloading for unique string types to get this working, then everybody is happy and we've unicode support? Indeed that could work. But the operator overloading would need to override the widestring managed. Actually it will be a bit confuse to have 2 methods to change the assignments: the widestring manager and the operator overloading. We also need a {$ to set which string type string should be. And then we need a set of RTL routines using utf8string, but Lazarus developers/users can write them after the other parts are working. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
On Fri, Nov 21, 2008 at 11:30 AM, Michael Schnell [EMAIL PROTECTED] wrote: http://wiki.freepascal.org/FPC_Unicode_support#Roadmap_of_RTL_Unicode_support This page does not talk about UTF8Strings being counted in code elements vs in code points. I only added the roadmap section, the rest of the content existed before. You are welcome to amend the content. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Fri, Nov 21, 2008 at 11:51 AM, Felipe Monteiro de Carvalho [EMAIL PROTECTED] wrote: It would be really good if there was a guide, preferably in the wiki, to explain how to add a new test case to Free Pascal. I have already some test cases in mind, like making sure static methods compile (an error in 2.2.2) and then after some discussion the utf-8 test cases. At the moment I can't add the test cases because I don't know how to. For everything but deep compiler stuff, I would think fpcUnit should do perfectly. After all, that is what fpcUnit (or unit testing in general) is for. And unit tests can have a GUI or Text (console) test runner. The latter being handy for automated runs - daily or hourly. tiOPF project does this and we have around 1600 unit tests. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
I only added the roadmap section, the rest of the content existed before. You are welcome to amend the content. I'd rightfully be severely bashed by those who actually will be required to do the work ;) . -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
Zitat von Graeme Geldenhuys [EMAIL PROTECTED]: On Fri, Nov 21, 2008 at 11:45 AM, Michael Schnell [EMAIL PROTECTED] wrote: In fact it should automatically convert (as correctly as possible) between all available string types (ANSI, UTF8, UTF16). And the compiler should produce a warning if you assign UTF8 or UTF16 string to a ANSI string. Mentioning that conversion is not 100% possible and you stand a chance to loose data. ... and a possibility to tell the compiler 'Thanks, I know. Don't bark about this place any longer'. Mattias ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Fri, Nov 21, 2008 at 12:47 PM, Mattias Gärtner [EMAIL PROTECTED] wrote: ... and a possibility to tell the compiler 'Thanks, I know. Don't bark about this place any longer'. :-) Yes definately! Like the wish for Parameter not being used or Sender not being user etc... Those drive me nuts! Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] wrong rtti default value in the fixes_2_2 branch (dont know about trunk)
Michael Van Canneyt wrote: I fixed the bug in trunk. Please do some tests in Lazarus with the 12114 revision of the compiler. If all works still OK and the testsuites don't give any regressions, I'll merge it to the fix branch. Here nothing bad happen - at least I had not note. If you have no related tracker issues then maybe you will merge your fix? Best regards, Paul Ishenin. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] wrong rtti default value in the fixes_2_2 branch (dont know about trunk)
On Fri, 21 Nov 2008, Paul Ishenin wrote: Michael Van Canneyt wrote: I fixed the bug in trunk. Please do some tests in Lazarus with the 12114 revision of the compiler. If all works still OK and the testsuites don't give any regressions, I'll merge it to the fix branch. Here nothing bad happen - at least I had not note. If you have no related tracker issues then maybe you will merge your fix? I will do so tonight. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support - for the 20th time... ;-)
In our previous episode, Felipe Monteiro de Carvalho said: When Marco said yet another 2.2 fixes branch release, he meant 2.2.6. Ah, ok ... =) So my commend would then be changed to: Unicode is what is most discussed and needed at the moment. What is the point in making a major release without any major change? For me it doesn~t matter if it will take time. Both branches are divergating, and merging gets more difficult. Also all changes (like a lot of alignment stuff for ARM) would be held up. Note that I'm happy with either way. I just want a regular release schedule no matter what course (early 2.4 or late 2.4) is taken, and paint the consequences of declaring the uncodestuff. I want to avoid self-delusion of painting optimistic timeschedules and saying this time a major release preparation won't take as long as last time. It is late november now, and 2.4 preparation hasn't started, so I have doubts we'll see 2.4 before summer, even _IF_ we don't wait for more unicode functionality (ansistring_with_UTF8, TEncoding) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On 21 Nov 2008, at 10:51, Felipe Monteiro de Carvalho wrote: It would be really good if there was a guide, preferably in the wiki, to explain how to add a new test case to Free Pascal. It is documented here: http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/tests/readme.txt?view=markup You can find tons of examples under http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/tests/ Description of the subdirectory names: stolen from Florian test: systematic tests, usually developed by test driven development webtbs: tests derived from bug tracker bugs requiring successful compilation and run webtbf: tests derived from bug tracker bugs requiring failing of compilation tbs: tests derived from non tracker reports or ideas while fixing something requiring successfull compilation and run tbf: tests derived from non tracker reports or ideas while fixing something requiring failing compilation /stolen Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode and Lazarus
On Fri, Nov 21, 2008 at 11:45 AM, Michael Schnell [EMAIL PROTECTED] wrote: In fact it should automatically convert (as correctly as possible) between all available string types (ANSI, UTF8, UTF16). And the compiler should produce a warning if you assign UTF8 or UTF16 string to a ANSI string. Mentioning that conversion is not 100% possible and you stand a chance to loose data. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Michael Schnell wrote: I prefer it to be counted in bytes. If it is counted in Bytes then I can build a routine that counts in real chars. And we already have a lot of code to handle utf-8 inside ansisstring which depends on that. Counting the elements in real chars is very ineficient. This is commonly agreed, But counting in code elements breaks old code counting in code points sometimes is more handy. That is why I vote for making the default syntax (s[i], pos(), copy(), ...) user selectable, while of course providing dedicated functions for both flavors. If Length() would return its value in chars, what length in *bytes* would the following call set: SetLength(utfstring_1), Length(utfstring_2)); ?? Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
If Length() would return its value in chars, what length in *bytes* would the following call set: SetLength(utfstring_1), Length(utfstring_2)); I don't really understand your question. I think would would need to have two different function UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String), first giving the string length in code elements (byte) and second giving the length in code points (unicode characters), So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1. I think we should have a third function Length(UTF8String) that can be selected by the user (e.g. via a {$ option to be mapped to wither of the two. The same would be necessary for the SetLength function e.g. (1) UTF8ElementSetLength(utfstring_1), UTF8ElementLength(utfstring_2)); or (2) UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2)); (2) would work as expected if the purpose i to delete all but the first n characters in a string. I don't see a decent use for (1) other than creating a string long enough to use as a buffer for e.g. TStream.read. I do see that there in fact is a compatibility problem when porting old code with the setting of UTF8Count=Point. here SetLength(utfstring_1), Length(utfstring_2)); would be translated as UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2)); which does not make sense if UTF8PointLength(utfstring_1) is smaller than UTF8PointLength(utfstring_2). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
On 21 Nov 2008, at 14:50, Michael Schnell wrote: If Length() would return its value in chars, what length in *bytes* would the following call set: SetLength(utfstring_1), Length(utfstring_2)); I don't really understand your question. I think would would need to have two different function UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String), first giving the string length in code elements (byte) and second giving the length in code points (unicode characters), So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1. Or 2, depending on whether it's predcomposed or decomposed. I think we should have a third function Length(UTF8String) that can be selected by the user (e.g. via a {$ option to be mapped to wither of the two. He's simply talking about the case where Length is mapped to your proposed UTF8PointLength. I do see that there in fact is a compatibility problem when porting old code with the setting of UTF8Count=Point. here SetLength(utfstring_1), Length(utfstring_2)); would be translated as UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2)); which does not make sense if UTF8PointLength(utfstring_1) is smaller than UTF8PointLength(utfstring_2). It does not make any sense under any circumstances, because there is no way for UTF8PointSetLength to know how many bytes it has to allocate when you pass a value (any value, regardless of where it comes from) to it. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Michael Schnell wrote: I don't really understand your question. I think would would need to have two different function UTF8ElementlLength(UTF8String) and UTF8PointLength(UTF8String), first giving the string length in code elements (byte) and second giving the length in code points (unicode characters), So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1. I think we should have a third function Length(UTF8String) that can be selected by the user (e.g. via a {$ option to be mapped to wither of the two. The same would be necessary for the SetLength function e.g. (1) UTF8ElementSetLength(utfstring_1), UTF8ElementLength(utfstring_2)); or (2) UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2)); (2) would work as expected if the purpose i to delete all but the first n characters in a string. I don't see a decent use for (1) other than creating a string long enough to use as a buffer for e.g. TStream.read. I do see that there in fact is a compatibility problem when porting old code with the setting of UTF8Count=Point. here SetLength(utfstring_1), Length(utfstring_2)); would be translated as UTF8PointSetLength(utfstring_1), UTF8PointLength(utfstring_2)); which does not make sense if UTF8PointLength(utfstring_1) is smaller than UTF8PointLength(utfstring_2). The SetLength function is used mostly for allocating the storage for the new strings. Yes, it can be used for truncating the overlong strings, but truncating can be perfectly done with Delete (or UTF8Delete). As you mentioned yourself, allocating utf-8 strings using length in codepoints is senseless. This is exactly what I wanted to say initially. What follows is that for calls like SetLength(str1, Pos('foo', str2)) you also cannot freely change the return value of Pos() from elements to codepoints. And so on, and so forth. Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1. Or 2, depending on whether it's predcomposed or decomposed. I seem to remember that we discussed this some time ago and the result was that the compose (MAC style ?) characters in fact are a single code point (Unicode character) that consists of two (maybe more ? ) complete code points that are tied together by some special coding, so IMHO it can be considered as a single Unicode character in both cases. If this would result in a huge table of possibly composed characters I thing we would stick to the concept of providing a decent functionality and restrict on those that are currently used by the customers we normally address (Mac in Europe and America). A method to provide an extended composition table should be provided to have those help themselves who really need it. which does not make sense if UTF8PointLength(utfstring_1) is smaller than UTF8PointLength(utfstring_2). It does not make any sense under any circumstances, because there is no way for UTF8PointSetLength to know how many bytes it has to allocate when you pass a value (any value, regardless of where it comes from) to it. If UTF8PointLength(utfstring_1) is greater than UTF8PointLength(utfstring_2) no new bytes need to be allocated but the function is just equivalent to utfstring1 := UTF8PointCopy(utfstring1, 1, UTF8PointLength(utfstring_2)); To me this does not seem to impose any problem. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
you also cannot freely change the return value of Pos() from elements to codepoints. Of course the counting needs to be consistent for all string functions. So changing it on the fly is dangerous (if you keep a count value in an integer variable). But this is up to the user. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
On 21 Nov 2008, at 16:16, Michael Schnell wrote: So UTF8ElementlLength('Ü') would be 2 and UTF8PointLength('Ü') would be 1. Or 2, depending on whether it's predcomposed or decomposed. I seem to remember that we discussed this some time ago and the result was that the compose (MAC style ?) Decomposed and precomposed have nothing to do with Windows vs Mac OS X vs Linux vs whatever. They are both equally valid ways to represent UTF strings and both have their uses (on all platforms). All programs should also be prepared to deal with them, since you never know what kind of input you will get. characters in fact are a single code point (Unicode character) that consists of two (maybe more ? ) complete code points that are tied together by some special coding, so IMHO it can be considered as a single Unicode character in both cases. If this would result in a huge table of possibly composed characters I thing we would stick to the concept of providing a decent functionality and restrict on those that are currently used by the customers we normally address (Mac in Europe and America). I think you are talking about a different we. Further, inventing our own meanings of what a code point or unicode character means is an extremely bad idea (you'd also have to rename UTF*Point* routines to UTF*FPCLikeChar* so they properly indicate the fact that they do not deal with code points). UTF by itself already has enough variations to deal with, we will not add our own. which does not make sense if UTF8PointLength(utfstring_1) is smaller than UTF8PointLength(utfstring_2). It does not make any sense under any circumstances, because there is no way for UTF8PointSetLength to know how many bytes it has to allocate when you pass a value (any value, regardless of where it comes from) to it. If UTF8PointLength(utfstring_1) is greater than UTF8PointLength(utfstring_2) no new bytes need to be allocated but the function is just equivalent to utfstring1 := UTF8PointCopy(utfstring1, 1, UTF8PointLength(utfstring_2)); To me this does not seem to impose any problem. Except if the point is to reserve exactly enough space for utfstring1 and to overwrite its contents with something else afterwards (using move() or whatever). That's a very common use of setlength (at least in the FPC run time library, and I guess elsewhere as well). The fact that it also doesn't work if the string has to be made longer is basically the same problem. Your system just does not work, and the more examples you give the more it falls down, as far as I can see. Please first write a wiki page explaining how to deal with all cases, or at least noting which cases will not work. Only then it is possible to decide on whether or not it is both feasible and worthwhile to go through the trouble of implementing all this. Without it, I feel I am mainly wasting my time writing these mails because it seems you haven't thought it through yet at all. Jonas___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
If your point is that there is no way to allow for legacy code to be used with a String type that holds UTF8 code and that it is not possible (or desirable) to allow for code used in simple occasions that is understandable to someone who does not want to go into the complete depth of the UTF8, I can totally accept this. But in that case the normal user just should not use UTF8 (but WideStrings that in most European/American Projects can be considered to be UCS2 coded (This is the way that D2009 seems to go). With that of course the UTF8 API of LCL is not at all desirable,. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Op Fri, 21 Nov 2008, schreef Michael Schnell: If your point is that there is no way to allow for legacy code to be used with a String type that holds UTF8 code and that it is not possible (or desirable) to allow for code used in simple occasions that is understandable to someone who does not want to go into the complete depth of the UTF8, I can totally accept this. Legacy code that assumes ASCII can be used in UTF-8. Code that needs to deal with higher code points needs to be rewritten and the user must understand the full UTF-8 spec. There is no other way to hide this. But in that case the normal user just should not use UTF8 (but WideStrings that in most European/American Projects can be considered to be UCS2 coded (This is the way that D2009 seems to go). I agree with your observation. With that of course the UTF8 API of LCL is not at all desirable,. LCL had its reasons to go UTF8. Daniël___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Folks, before your waste your time again with endless discussions, have a look at Yury's work on an unicode rtl, test it and help with patches and suggestions, it's available in svn at http://svn.freepascal.org/svn/fpc/branches/unicodertl ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Legacy code that assumes ASCII can be used in UTF-8. Code that needs to deal with higher code points needs to be rewritten This is any Program that formerly used (ANSIS) String and now is automatically converted to use UTF8 and that is to be released in Germany, France With that of course the UTF8 API of LCL is not at all desirable,. LCL had its reasons to go UTF8. And thus forces all users to understand the full UTF-8 spec and to rewrite their programs, even though the old code perfectly compiles and up to a certain extent seems to work. This is what I think is not at all desirable :( . -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
From: Florian Klaempfl [EMAIL PROTECTED] Folks, before your waste your time again with endless discussions, have a look at Yury's work on an unicode rtl, test it and help with patches and suggestions, it's available in svn at http://svn.freepascal.org/svn/fpc/branches/unicodertl It is works for win32 only for now. Only system unit is finished. Work in progress... Yury. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
On Fri, Nov 21, 2008 at 2:42 PM, Michael Schnell [EMAIL PROTECTED] wrote: And thus forces all users to understand the full UTF-8 spec and to rewrite their programs, even though the old code perfectly compiles and up to a certain extent seems to work. This is what I think is not at all desirable :( . Your comments are absolutely vague and meaningless. Not to mention thay also don't propose an alternative. Sorry to be blunt, but so were your comments. -- Felipe Monteiro de Carvalho ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Felipe Monteiro de Carvalho wrote: On Fri, Nov 21, 2008 at 2:42 PM, Michael Schnell [EMAIL PROTECTED] wrote: And thus forces all users to understand the full UTF-8 spec and to rewrite their programs, even though the old code perfectly compiles and up to a certain extent seems to work. This is what I think is not at all desirable :( . Your comments are absolutely vague and meaningless. Not to mention thay also don't propose an alternative. Sorry to be blunt, but so were your comments I must agree with the FPC can not to it all automatically line (as much as I regret, and admit the beauty there was, if fpc could). What I mean is: 1) Any Application/Program, that currently compiles and works (using none utf8, never mind if ascii or ansi) will keep working, if compiled using *none* utf8 mode. 2) If such a program wants to be compiled to be extended to utf8 support, then there is a need for decisions that can not be made without knowledge what the program is doing. Or even within the same program in which context the operation takes place. Such knowledge is only available to the programmer of this application, therefore the application must be changed to include this decisions. FPC simple can not make them. (And even {$SWITCH} would not solve the issue.) Example is the composed and decomposed ü: - If you edit a text (human readable text), or search in a text, you certainly do want to handle both representations as equals (a Find dialog must find both) - If the same text editor saves the file, it must handle them as non equal. Assume the user has 2 files wünsche.txt in the same folder. The filesystem allows this, because one of them is decomposed and one is composed. If the user had opened a text from the composed version, it should be written back to the composed version. If the user had opened it from the decomposed version it must be written back to the decomposed version. Otherwise a completely unrelated file would simply be overwritten, and the contents lost. (the same applies if the application iterates through the directory content and compares file names. So here the same compare version that would be used by the Find dialog must behave different) FPC can simply not know, if a string contains a file name, which must be kept exactly as it, or a string contains some human readable text, which would benefit from a normalisation. If you are going to put a compiler switch in front of each statement to indicate the needs, you may as well change the statements. There is no one statement for the whole application, as both of the above example occur within a single application. You could use two different UTF8Strings which behave different on decomposed chars (I am *not* proposing this as a solution). But then you can not just recompile your app by saying string now means UTF8String throughout the whole application. You have again to go through all of the source code and edit the app. So you may as well just go through the sourcecode, and add the appropriate utf8-clean up calls to those part in the code, that will need it. In the end, switching an application to unicode means that within the same app different parts are going to need different handling of unicode (where no such difference existed for ascii/ansi). And no compiler can figure out which part will need which behaviour. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] wrong rtti default value in the fixes_2_2 branch (dont know about trunk)
On Fri, 21 Nov 2008, Paul Ishenin wrote: Michael Van Canneyt wrote: I fixed the bug in trunk. Please do some tests in Lazarus with the 12114 revision of the compiler. If all works still OK and the testsuites don't give any regressions, I'll merge it to the fix branch. Here nothing bad happen - at least I had not note. If you have no related tracker issues then maybe you will merge your fix? Merged. Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] new 27 page document describing Unicode support in D2009
Hello, I thought you guys might find this interesting. It's a new 27 page document describing Unicode support in D2009. http://dn.codegear.com/article/38980 -- Abstract: Learn more about the new Unicode support in Delphi 2009 and CodeGear RAD Studio 2009 in this white paper by Marco Cantù Delphi and Unicode One of the most relevant new features of Delphi 2009 is its complete support for the Unicode character set. While Delphi applications written exclusively for the English language and based on a 26-character alphabet were already working fine and will keep working fine in Delphi 2009, applications written for most other languages spoken around the world will have a distinct benefit by this change. Learn more about Unicode in Delphi 2009 and CodeGear RAD Studio 2009 in this white paper. -- Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
[fpc-devel] Re: new 27 page document describing Unicode support in D2009
On Fri, Nov 21, 2008 at 11:08 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: I thought you guys might find this interesting. It's a new 27 page document describing Unicode support in D2009. http://dn.codegear.com/article/38980 Seeing that I don't own D2009 and only read about it's Unicode support I found some of the information interesting - and it was things we argued about in this mailing list. For example: 1... Length() returns the bytes for UTF8String but Length() returns the elements (what we know as characters) for String or UTF16 strings. Length() also returns bytes for AnsiString. var str8: Utf8String; str16: string; begin str8 := 'Cantù'; Memo1.Lines.Add ('UTF-8'); Memo1.Lines.Add('Length: ' + IntToStr (Length (str8))); Memo1.Lines.Add('5: ' + IntToStr (Ord (str8[5]))); Memo1.Lines.Add('6: ' + IntToStr (Ord (str8[6]))); str16 := str8; Memo1.Lines.Add ('UTF-16'); Memo1.Lines.Add('Length: ' + IntToStr (Length (str16))); Memo1.Lines.Add('5: ' + IntToStr (Ord (str16[5]))); As you might expect, the str8 string has a length of 6 (meaning 6 bytes), while the str16 string has a length of 5 (meaning 10 bytes, though). Notice that Length invariably returns the number of string elements, which in case of variable-length representations don't match the number of Unicode code points represented by the string. This is the output of the program: UTF-8 Length: 6 5: 195 6: 185 UTF-16 Length: 5 5: 249 2... TStrings can now take an encoding parameter to specify how it should load or save files. - STREAMING TSTRINGS The ReadFromFile and WriteToFile methods of the TStrings class can be called with an encoding. If you write a string list to text file without providing a specific encoding, the class will use TEncoding.Default, which uses the internal DefaultEncoding in turn extracted at the first occurrence by the current Windows code page. In other words, if you save a file you'll get the same ANSI file as before. Of course, you can also easily force the file to a different format, for example the UTF-16 format: Memo1.Lines.SaveToFile('test.txt', TEncoding.Unicode); - anyway, there are a lot more interesting facts in this document. Well worth reading to get a better understanding of unicode. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Re: new 27 page document describing Unicode support in D2009
In our previous episode, Graeme Geldenhuys said: I thought you guys might find this interesting. It's a new 27 page document describing Unicode support in D2009. http://dn.codegear.com/article/38980 Seeing that I don't own D2009 and only read about it's Unicode support I found some of the information interesting - and it was things we argued about in this mailing list. This is all information that is already on the blogs since July. Note that Tcharacter is a sealed class, something that FPC doesn't support yet. The whole tencoding/tcharacter is a bastard-class stuff seems to be out of .NET compatibility (as noted in the document), but Borland changed course of its .NET efforts after Tiburon. Sigh. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Re: new 27 page document describing Unicode support in D2009
Graeme Geldenhuys escreveu: On Fri, Nov 21, 2008 at 11:08 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: I thought you guys might find this interesting. It's a new 27 page document describing Unicode support in D2009. http://dn.codegear.com/article/38980 Seeing that I don't own D2009 and only read about it's Unicode support I found some of the information interesting - and it was things we argued about in this mailing list. For example: 1... Length() returns the bytes for UTF8String but Length() returns the elements (what we know as characters) for String or UTF16 strings. No Length for String will return the number of Code Units (the number of WideChar in UnicodeString case). When there's surrogate pairs it will differ the number of Code Points (Characters) and Code Units. See the excerpt: A way to create a string with surrogate pairs is to use the ConvertFromUtf32 function that returns a string with the surrogate pair (two WideChar) in the proper circumstances, like the following: var str1: string; begin str1 := 'Surr. ' + ConvertFromUtf32($1D11E); Now if you ask for the string length, you'll get 8, which is the number of WideChar, but not the number of logical Unicode code points in the string. If you print the string you get the proper effect (well, at least Windows will generally show one square block as placeholder of the surrogate pair, rather than two). Length() also returns bytes for AnsiString. var str8: Utf8String; str16: string; begin str8 := 'Cantù'; Memo1.Lines.Add ('UTF-8'); Memo1.Lines.Add('Length: ' + IntToStr (Length (str8))); Memo1.Lines.Add('5: ' + IntToStr (Ord (str8[5]))); Memo1.Lines.Add('6: ' + IntToStr (Ord (str8[6]))); str16 := str8; Memo1.Lines.Add ('UTF-16'); Memo1.Lines.Add('Length: ' + IntToStr (Length (str16))); Memo1.Lines.Add('5: ' + IntToStr (Ord (str16[5]))); As you might expect, the str8 string has a length of 6 (meaning 6 bytes), while the str16 string has a length of 5 (meaning 10 bytes, though). Notice that Length invariably returns the number of string elements, which in case of variable-length representations don't match the number of Unicode code points represented by the string. This is the output of the program: UTF-8 Length: 6 5: 195 6: 185 UTF-16 Length: 5 5: 249 2... TStrings can now take an encoding parameter to specify how it should load or save files. - STREAMING TSTRINGS The ReadFromFile and WriteToFile methods of the TStrings class can be called with an encoding. If you write a string list to text file without providing a specific encoding, the class will use TEncoding.Default, which uses the internal DefaultEncoding in turn extracted at the first occurrence by the current Windows code page. In other words, if you save a file you'll get the same ANSI file as before. Of course, you can also easily force the file to a different format, for example the UTF-16 format: Memo1.Lines.SaveToFile('test.txt', TEncoding.Unicode); - anyway, there are a lot more interesting facts in this document. Well worth reading to get a better understanding of unicode. Regards, - Graeme - ___ fpGUI - a cross-platform Free Pascal GUI toolkit http://opensoft.homeip.net/fpgui/ ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Re: new 27 page document describing Unicode support in D2009
Graeme Geldenhuys wrote: On Fri, Nov 21, 2008 at 11:08 PM, Graeme Geldenhuys [EMAIL PROTECTED] wrote: I thought you guys might find this interesting. It's a new 27 page document describing Unicode support in D2009. http://dn.codegear.com/article/38980 Seeing that I don't own D2009 and only read about it's Unicode support I found some of the information interesting - and it was things we argued about in this mailing list. Well, with exclusion of the class helper for TStrings (notable is that they call it a hack themselves :) the design looks rather clean. Since each string stores its element size, both ansi and unicode strings are probably handled with common set of procedures, avoiding RTL size bloat. And they explain why there is no compiler option for switching back and forth. Unfortunately, the article does not provide information about how things like Pos() and Copy() work with utf8 strings. However, one may understand words utf-8 support is more limited than utf-16 as they continue to work with elements (bytes). Regards, Sergei ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Re: new 27 page document describing Unicode support in D2009
Sergei Gorelkin escreveu: Well, with exclusion of the class helper for TStrings (notable is that they call it a hack themselves :) the design looks rather clean. Since each string stores its element size, both ansi and unicode strings are probably handled with common set of procedures, avoiding RTL size bloat. I also like the design since is flexible enough to allow the programmer work with different encodings. And they explain why there is no compiler option for switching back and forth. Unfortunately, the article does not provide information about how things like Pos() and Copy() work with utf8 strings. Here ( http://www.jacobthurman.com/?p=30 see comments) there's an explanation about those functions. Basically they will handle Code Units and not Code Points (characters) However, one may understand words utf-8 support is more limited than utf-16 as they continue to work with elements (bytes). Yes. This is a good decision also IMO. Luiz ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] Unicode support in RTL - Roadmap
Graeme Geldenhuys escreveu: Hi, I have added a Roadmap section in the following wiki page. If you find anything missing or not 100% implemented, please add it to the wiki page. http://wiki.freepascal.org/FPC_Unicode_support#Roadmap_of_RTL_Unicode_support I started a wiki page to list the use cases where the developers (fpc users) are facing problems when dealing with Unicode. This can be useful to define what the programmers are expecting from the fpc Unicode support. Optionally, suggestion can be made to how fpc can handle each case. http://wiki.freepascal.org/unicode_use_cases Luiz ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel