subject:"\[fpc\-devel\] String handling in trunk $was utf8 in 2.6.0$"


On 01/05/2013 12:28 PM, Jonas Maebe wrote:
Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 
encoding of that character.
Sorry, I can't follow. Does #xx not just define a numerical 
representation of an 8 bit entity ?


The interpretation in any code might be done later by any code that 
digests the string.


Am I wrong ?

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 01/05/2013 01:35 PM, Jy V wrote:

I do vote for UTF-8

-1

Regarding that conversions in the RTL (or LCL) are a rather seldom 
runtime-task, GUI performance issues are not really necessary to be 
considered.


Viable issues seem to be Delphi compatibility, backward compatibility, 
usability, runtime-performance with time consuming complex string tasks 
(these seem to vote against UTF8, but for either static UTF 16 or 
(quasi-) dynamical (CE-alike) encoding; and memory usage and 
runtime-performance with time consuming simple string tasks (which vote 
for locale-based ANSI or UTF-8).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Ewald

Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell
said:
 On 01/05/2013 12:28 PM, Jonas Maebe wrote:
 Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8
 encoding of that character.
 Sorry, I can't follow. Does #xx not just define a numerical
 representation of an 8 bit entity ?

 The interpretation in any code might be done later by any code that
 digests the string.

 Am I wrong ?
I *think* Jonas is trying to say that if you want the character `Ǿ` in a
string you would either type
- 'Ǿ' or
- #$C7#$BE if you want to keep the source free of encoding specific
characters

You as a programmer make up what you do with it afterwards, if you
decide to write it to an UTF-8 terminal, you would get `Ǿ`, and if you
write it to some other terminal you might see a character that matches
$C7, followed by a character that matches $BE in the lookuptable of the
encoding of the terminal. Look at it this way: the byte sequence ($C7,
$BE) has got no meaning to the compiler whatsoever, it is a byte
sequence. That's what matters to the compiler, what is in this sequence
is for you to decide.

Correct me if I'm wrong.

-- 
Ewald

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Tomas Hajny

On Mon, January 7, 2013 13:28, Ewald wrote:
 Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell
 said:
 On 01/05/2013 12:28 PM, Jonas Maebe wrote:
 Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8
 encoding of that character.
 Sorry, I can't follow. Does #xx not just define a numerical
 representation of an 8 bit entity ?

 The interpretation in any code might be done later by any code that
 digests the string.

 Am I wrong ?
 I *think* Jonas is trying to say that if you want the character `Ǿ` in a
 string you would either type
 - 'Ǿ' or
 - #$C7#$BE if you want to keep the source free of encoding specific
 characters
 .
 .

...or
- #$01FE and then the whole string becomes a Unicode string which is
either kept that way (if it is assigned to a UnicodeString constant), or
it is converted to some 8-bit encoding at compile time (if it is assigned
to an 8-bit constant/variable like ansistring)

(also just my understanding of what Jonas wrote)

Tomas


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

So the ambiguity  with _filling_ a string with data in fact arises when 
_not_ using the #nn notation :-) . With #nn the effect (i.e. the 
resulting binary) is obvious.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 01/07/2013 02:01 PM, Tomas Hajny wrote:

(also just my understanding of what Jonas wrote)


I feel you are wrong. The string does not know about the code it's 
content is to be interpreted in (other than with Delphi XE).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Ewald

Once upon a time, on 01/07/2013 02:17 PM to be precise, Michael Schnell
said:
 So the ambiguity  with _filling_ a string with data in fact arises
 when _not_ using the #nn notation :-) . With #nn the effect (i.e. the
 resulting binary) is obvious.
Well, if there is literally the sequence $C7, $BE in your source code
(that is, open up a hex editor and actually see the values there, as one
byte each) that would also do the same, as the compiler will default to
one byte strings I think. The only issue with this is that you also need
to set your code editor to the encoding you want 'cause otherwise it
will screw up the display and possible binary value of the character.

So, yes I would say the #nn notation is probably the safest to use, also
handy if your character contains (or is) something that `cannot be
there`, like a newline: #10 (or #13#10 under windows)

Also, if you use a literal utf-16 char in the code (so no #, but the
actual character) I think the {$codepage utf16} directive might come in
handy, as otherwise the compiler will interpret this series of bytes as
sperate single bytes characters. This is however not an issue with the
# notation, as there is no ambiguity with this interpretation.

-- 
Ewald

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Tomas Hajny

On Mon, January 7, 2013 14:19, Michael Schnell wrote:
 On 01/07/2013 02:01 PM, Tomas Hajny wrote:
 (also just my understanding of what Jonas wrote)

 I feel you are wrong. The string does not know about the code it's
 content is to be interpreted in (other than with Delphi XE).

Sorry, your way of quoting makes it difficult for others to react.

I freely admit that I may be wrong, but I don't understand what you meant
with your comment and thus I don't understand in what way you I am wrong
in your view. The compiler obviously knows how the constant is used within
the source code and thus it may proceed accordingly (i.e. either convert
it to some 8-bit encoding at compile time if UTF-16 code constant appears
in the source, or keep it in UTF-16 if assigned to a UnicodeString
constant).

Tomas


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Ewald

Once upon a time, on 01/07/2013 05:05 PM to be precise, Tomas Hajny said:
 On Mon, January 7, 2013 14:19, Michael Schnell wrote:
 On 01/07/2013 02:01 PM, Tomas Hajny wrote:
 (also just my understanding of what Jonas wrote)
 I feel you are wrong. The string does not know about the code it's
 content is to be interpreted in (other than with Delphi XE).
 Sorry, your way of quoting makes it difficult for others to react.

 I freely admit that I may be wrong, but I don't understand what you meant
 with your comment and thus I don't understand in what way you I am wrong
 in your view. The compiler obviously knows how the constant is used within
 the source code and thus it may proceed accordingly (i.e. either convert
 it to some 8-bit encoding at compile time if UTF-16 code constant appears
 in the source, or keep it in UTF-16 if assigned to a UnicodeString
 constant).
Yep, the compiler does know how the constant is used and how it is
defined (how else could it generate working code?), but I don't see how
it could do something with it if it is assigned to another type of
string (by type I mean `one-byte versus two-byte`). The compiler can't
know for sure what you mean, it can do at least these things:
  - Copy data without translating, so a one char two-byte string becomes
a two char one-byte string; a three char one-byte string would become a
three char two byte string; and then there is a pardox: should a
three-char two-byte string become a six-char one-byte string? == this
is probably not how it is done
  - Translate the meanings of the characters of the string, but here the
compiler needs to know in what encoding they are and in what encoding
the string is wanted. (which it doesn't I believe; the $codepage
directive is only used for the encoding of the characters in the unit
intself) == I think this also isn't a a possibility
  - Copy the data byte per byte, but then a one-byte string containing
an uneven amount of chars needs padding + there are issues with
endianness here == Not really an option no?
  - Truncate every value of a two-byte string to convert it two a one
byte string; the other way around would put each character of the
one-byte string as one in the two-byte string == Solves the first
paradox, but introduces loss of data

== All the above options (except the translation, that is) ignore the
escape charachter(s) of the string, so you wont get the data you want.

IMO I don't think it (typecasting a one-byte string to a two-byte
string) can be done without human intervention. Look at it this way:
typecasting a thread handle to an integer makes no sense either:
  - They are both related (a thread handle is definitely a number, even
if it is a pointer)
  - But putting one in the other makes no sense at all: what does
`comparing whether a thread id is less than zero` mean? on the other
hand `comparing whether an integer is less than zero` has a distinct
meaning.
  - The sizes may be different (say an integer of 16 bit long and a
thread handle of 64 bit long), how do you put one in the other? Sum the
bytes together? Multiply them? Take the 16 bit CRC of the handle?

This is IMO the same with a one-byte char and a two byte char:
 - They both represent letters/words/...
 - But they are not the same and cannot be typecasted without extra
knowlegde.

This last point is also valid for my example above: you could put all
thread ids you know of in a lookup-table and put the index in that
lookup-table in the 16-bit integer. Fixed. Same goes for our strings: if
you know one is UTF-8 and you want to convert it to UTF-16 it can be
done without error, but without this extra knowledge it can't give you
decisive results.

Just a few points I think bear some potential to contemplate over a cup
of $c0ffee ;-)

-- 
Ewald

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Mark Morgan Lloyd


Tomas Hajny wrote:

On Mon, January 7, 2013 13:28, Ewald wrote:

Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell
said:

On 01/05/2013 12:28 PM, Jonas Maebe wrote:

Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8
encoding of that character.

Sorry, I can't follow. Does #xx not just define a numerical
representation of an 8 bit entity ?

The interpretation in any code might be done later by any code that
digests the string.

Am I wrong ?

I *think* Jonas is trying to say that if you want the character `Ǿ` in a
string you would either type
- 'Ǿ' or
- #$C7#$BE if you want to keep the source free of encoding specific
characters

 .
 .

...or
- #$01FE and then the whole string becomes a Unicode string which is
either kept that way (if it is assigned to a UnicodeString constant), or
it is converted to some 8-bit encoding at compile time (if it is assigned
to an 8-bit constant/variable like ansistring)

(also just my understanding of what Jonas wrote)


That's how I read it as well. In which case, is #A3 16-bit Unicode 
(representing the UK £ Sterling) or malformed UTF-8 (should be #c2#a3)?


--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-07 Thread Aleksa Todorovic

On Mon, Jan 7, 2013 at 6:05 PM, Mark Morgan Lloyd 
markmll.fpc-de...@telemetry.co.uk wrote:

 Tomas Hajny wrote:

 On Mon, January 7, 2013 13:28, Ewald wrote:

 Once upon a time, on 01/07/2013 12:39 PM to be precise, Michael Schnell
 said:

 On 01/05/2013 12:28 PM, Jonas Maebe wrote:

 Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8
 encoding of that character.

 Sorry, I can't follow. Does #xx not just define a numerical
 representation of an 8 bit entity ?

 The interpretation in any code might be done later by any code that
 digests the string.

 Am I wrong ?

 I *think* Jonas is trying to say that if you want the character `Ǿ` in a
 string you would either type
 - 'Ǿ' or
 - #$C7#$BE if you want to keep the source free of encoding specific
 characters

  .
  .

 ...or
 - #$01FE and then the whole string becomes a Unicode string which is
 either kept that way (if it is assigned to a UnicodeString constant), or
 it is converted to some 8-bit encoding at compile time (if it is assigned
 to an 8-bit constant/variable like ansistring)

 (also just my understanding of what Jonas wrote)


 That's how I read it as well. In which case, is #A3 16-bit Unicode
 (representing the UK £ Sterling) or malformed UTF-8 (should be #c2#a3)?


The way I understand it is that #A3 will be effected by $codepage directive
of source file. So, if programmer correctly sets $codepage to match
encoding used in editor (be it utf8 or some other encoding), compiler will
also 'understand' that string correctly.

If programmer never uses UnicodeString, and always uses codepage which was
used to write source code, everything will work fine - #A3 will stay
whatever it is in specific encoding.

On the other hand, if there comes situation in which string containing #A3
needs to be converted to UnicodeString, compiler will either: a) convert it
correctly to UnicodeString if encoding used is utf8, or b) call
system-specific function to convert string to array of WideChar-s (in which
case, correctness of the program depends on support for specific encoding
on tharget system).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-06 Thread Hans-Peter Diettrich


Michael Van Canneyt schrieb:

IMO resource strings are for display purposes, so that UTF-8/16 
encoding is expected by an OS API.  AFAIR Win32 string resources are 
stored in UTF-16,


You are very much wrong.


Not really. I was talking about Win32 resources, not about what FPC 
makes from resourcestring.


To start with, resource strings are not stored as Win32 resources. 
Secondly, they are stored in the code as an ansistring.


The resource string of the above example is stored as:

.globl  _$PROGRAM$_Ld2
_$PROGRAM$_Ld2:
.ascii  Something\000
.balign 8
.short  0,1
.long   0
.quad   -1,15
.globl  _$PROGRAM$_Ld3

Thirdly: in my apps, no UTF-8/16 encoding is expected by the OS. If it 
were, I would have used widestrings instead of ansistring to begin with, 
and in that case I would not have made any remark...


I don't know which OS you're using, but the WinAPI uses UTF-16 
throughout. I suppose that other OS also use some Unicode string 
representation, for lossless representation of texts of all languages.


The dual W/A interface of Win32 is due to the stripped-down Win9x 
versions, which require Unicode extensions for supporting more than 
CP_ACP. But now we are in 2013, with Unicode being present everywhere.



So the conversion really would be 100% totally redundant.


It may look so to you...

Why then do you use resourcestring instead of ordinary string constants?


Another note and question, about multi-lingual resources. Windows 
resource scripts (.RC) allow for multi-lingual stringtables. In my 
recent research I learned that the resource compiler extracts the 
requested language from the script, and stores only these strings in the 
resource file (.RES) and application (.EXE, .DLL). That's why 
resourcestring was added to Delphi.


How does FPC support the same? (.PO files?)

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-06 Thread Hans-Peter Diettrich


Paul Ishenin schrieb:

05.01.13, 23:54, Michael Van Canneyt пишет:


You are very much wrong.

To start with, resource strings are not stored as Win32 resources.


I personally think that resources should be stored in their native 
formats where is possible. This will allow to change them using software 
designed for that environment. For example for windows there are many 
resource editors which can replace icons, bitmap and string resources 
too. It would be nice to have this ability also for binaries which FPC 
do. On OS X resources are also stored different from what FPC do 
currently - they are stored in application bundles as I know, so they 
can be edited by external programs too.


Point taken :-)

But I'm not sure about nowadays use of native resources. Even on Windows 
most programs nowadays don't use Windows resources for their menus, 
dialog boxes etc. any more. I've used the Delphi ResourceWorkshop for 
some time, to tweak some third party programs and even Windows itself.


This will be almost impossible with current software. Try e.g. to set 
the Windows menu color to yellow, what I did for a long time, and you'll 
find out that the Explorer and many other Windows tools don't honor that 
setting. Or you'll find that these system settings have been removed at 
all, replaced perhaps by themes?


So I'm not sure about the use of native resources, nowadays. How should 
a multi-platform application handle a string or graphical (icon...) 
resource, so that it can be designed on one platform, and be shown on 
all other platforms without modifications?


With graphical resources I'd use a single internal (FPC) format, which 
is converted by the widgetset for use on the target platform. String 
resources may require more adjustments than only a translation, to match 
the different semantics of other languages - independently from the 
target platform.


That's why I'd suggest UTF-8 encoding for resource strings, what doesn't 
affect program logic because AnsiString still can be used. The *encoded* 
AnsiStrings require that the coder knows about the best encoding of 
every string, when he wants to reduce the number of implicit string 
conversions. Using AnsiString(CP_ACP) may be a reasonable decision for 
use in a program with *very* limited usage (one country, one language, 
one target platform...), but FPC should support programs with a broader 
audience as well.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-06 Thread Michael Van Canneyt




On Sun, 6 Jan 2013, Hans-Peter Diettrich wrote:


Michael Van Canneyt schrieb:

IMO resource strings are for display purposes, so that UTF-8/16 encoding 
is expected by an OS API.  AFAIR Win32 string resources are stored in 
UTF-16,


You are very much wrong.


Not really. I was talking about Win32 resources, not about what FPC makes 
from resourcestring.


The discussion is about unnecessary conversions in *FPC resourcestrings*, 
not about win32 resources.


Why you brought up the Windows resourcestrings was (and is) a mystery to me.
From your statement, I assumed that you probably thought FPC stores it's 

resourcestrings as win32 resources. It does not.

To start with, resource strings are not stored as Win32 resources. 
Secondly, they are stored in the code as an ansistring.


The resource string of the above example is stored as:

.globl  _$PROGRAM$_Ld2
_$PROGRAM$_Ld2:
.ascii  Something\000
.balign 8
.short  0,1
.long   0
.quad   -1,15
.globl  _$PROGRAM$_Ld3

Thirdly: in my apps, no UTF-8/16 encoding is expected by the OS. If it 
were, I would have used widestrings instead of ansistring to begin with, 
and in that case I would not have made any remark...


I don't know which OS you're using, but the WinAPI uses UTF-16 throughout.


I use both windows and Linux.

You are mistakenly assuming that I am using Windows GUI calls or so. 
There is no GUI.


Probably the only call that cares about codepage is FileCreate(), and that is 
not done using resource strings.
For the rest, all is done using FileWrite() and sendto()/recvfrom().
Both do not care about encoding. They transfer bytes, that's it.

So I use ansistrings throughout.

And hence, resourcestrings being stored in unicode format would cause totally 
unnecessary conversions.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Martin Schreiber

On Saturday 05 January 2013 11:30:42 Jonas Maebe wrote:
[...]

 For example, I said that basically nothing changed in 2.7.x compared to
 2.6.x, except that certain string constants are no longer automatically
 converted to utf-16 at compile time, and then you ask Or should we not
 touch the theme strings and FPC anymore?. Since basically nothing changed
 except for a few less blind auto-conversions at compile time, why should
 you no longer be able to touch anything anymore?

 Let me repeat: your string constants will be parsed by the compiler into
 character sequences with exactly the same content in both 2.6.x and 2.7.x
 (and with content I mean that if they would be converted to the same code
 page in 2.6.x and in 2.7.x, you would end up with exactly the same binary
 data). Whether or not they contain character literals whose value is #127
 in the source code's code page, or explicit #xx, #xxx etc expressions
 has no influence, nothing changed in the compiler in that account.

 The *only* difference is that the compiler can now internally represent
 ansistrings with arbitrary code pages, and as a result the aforementioned
 character sequences may now be stored internally in the compiler in a
 different format, and also stored in the program in a different format if
 that can avoid conversions at run time. In particular, character sequences
 are no longer all converted immediately/by default/under all circumstances
 to UTF-16 in case characters #127 need to be interpreted according to a
 particular code page (i.e., if a {$codepage xxx} directive is present).

 The compiler will now only convert such character sequences to UTF-16,
 still at compile time (just like it was able to do in 2.6.x), if it is
 actually assigned to an UTF-16-encoded string, passed to an UTF-16
 parameter etc. And the compiler will also convert it to another ansistring
 code page is case the character sequence appeared in e.g. a file with
 {$codepage utf-8} and is then assigned to a variable whose type is declared
 as type ansistring(850).

Thank you very much for the detailed explanation. What I could not found in 
all the answers (probably it is my ignorance of the English language), is, 
does #n mean a utf16 code unit as in Delphi XE3 or does it denote something 
other? You write:

 Whether or not they contain character literals whose value is #127
 in the source code's code page, or explicit #xx, #xxx etc expressions
 has no influence, nothing changed in the compiler in that account.

Assume {$codepage utf-8} how should we enter Russian character constants in #n 
form? How should we enter Russian character constants in #n form if 
{$codepage 8859-5} is defined?
And again, sorry for the impertinence, how do resource strings fit in the 
string handling scenario of Free Pascal trunk?

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 05 Jan 2013, at 12:12, Martin Schreiber wrote:

 Thank you very much for the detailed explanation. What I could not found in 
 all the answers (probably it is my ignorance of the English language), is, 
 does #n mean a utf16 code unit as in Delphi XE3 or does it denote something 
 other?

It was not in the explanation, because it is something that did not change 
between 2.6.x and 2.7.x. Whatever you use in 2.6.x will still work in exactly 
the same way in 2.7.x. The Delphi XE3 behaviour may be added to the {$mode 
delphiunicode} syntax mode, but has not yet been implemented and will never be 
applied to existing syntax modes.

 Assume {$codepage utf-8} how should we enter Russian character constants in 
 #n 
 form?

Using whatever #xx#xx or #xx#xx#xx sequence represents the UTF-8 encoding of 
that character.

 How should we enter Russian character constants in #n form if 
 {$codepage 8859-5} is defined?

Using whatever #xx represents that character in code page 8859-5.

Alternatively, in both cases you can instead define a unicodestring/widestring 
constant instead of an ansistring/shortstring constant by embedding widechar 
constants in the character sequence. Such widechar constants are of the form 
#number with number a valid Pascal representation of an integer constant 
between 255 and 65535. Then you can use those widechars to represent the 
desired characters as UTF-16 code points. In that case, the entire string will 
however be parsed as a sequence of UTF-16 code points (because a string is 
either a sequence of ansichars, or a sequence of widechars; it can never be a 
mixture of the two), and hence also #1 or #128 appearing in a widestring will 
be parsed as widechar(#1) and widechar(#128) as opposed to being interpreted 
according to the current codepage setting. 

 And again, sorry for the impertinence, how do resource strings fit in the 
 string handling scenario of Free Pascal trunk?

Unicode support for resourcestrings is still not available in FPC trunk. They 
can currently still only be used safely for ASCII content.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Sven Barth


On 05.01.2013 12:28, Jonas Maebe wrote:

And again, sorry for the impertinence, how do resource strings fit in the
string handling scenario of Free Pascal trunk?


Unicode support for resourcestrings is still not available in FPC trunk. They 
can currently still only be used safely for ASCII content.


What about UTF8 content?

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 05 Jan 2013, at 12:36, Sven Barth wrote:

 On 05.01.2013 12:28, Jonas Maebe wrote:
 And again, sorry for the impertinence, how do resource strings fit in the
 string handling scenario of Free Pascal trunk?
 
 Unicode support for resourcestrings is still not available in FPC trunk. 
 They can currently still only be used safely for ASCII content.
 
 What about UTF8 content?

You can put anything in it and it may or may not work depending on the current 
system code page, but afaik the only thing that is guaranteed to work at this 
time is plain ASCII.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Martin Schreiber

On Saturday 05 January 2013 12:28:03 Jonas Maebe wrote:

 Alternatively, in both cases you can instead define a
 unicodestring/widestring constant instead of an ansistring/shortstring
 constant by embedding widechar constants in the character sequence. Such
 widechar constants are of the form #number with number a valid Pascal
 representation of an integer constant between 255 and 65535. Then you can
 use those widechars to represent the desired characters as UTF-16 code
 points. In that case, the entire string will however be parsed as a
 sequence of UTF-16 code points (because a string is either a sequence of
 ansichars, or a sequence of widechars; it can never be a mixture of the
 two), and hence also #1 or #128 appearing in a widestring will be parsed as
 widechar(#1) and widechar(#128) as opposed to being interpreted according
 to the current codepage setting.

So compiled with -Fcutf8

unicodestringvar:= 'Best'#228'tigung';

produces a different result on fixes_2_6 and trunk? I assume in trunk there 
will be a compile error? We use this form of character constants in MSEgui to 
have the sources in pure ASCII.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Paul Ishenin


05.01.13, 19:40, Jonas Maebe пишет:


You can put anything in it and it may or may not work depending on the current 
system code page, but afaik the only thing that is guaranteed to work at this 
time is plain ASCII.


ResourceStrings are stored as AnsiString type with 0 codepage (as I 
remember). Delphi now stores ResourceStrings as UnicodeString type. I 
think FPC will follow this in m_default_unicodestring modeswitch.


Best regards,
Paul Ishenin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 05 Jan 2013, at 12:53, Martin Schreiber wrote:

 So compiled with -Fcutf8
 
 unicodestringvar:= 'Best'#228'tigung';
 
 produces a different result on fixes_2_6 and trunk? I assume in trunk there 
 will be a compile error?

No. In both cases it results in a widestring with this content:

.short  66,101,115,116,228,116,105,103,117,110,103,0

I guess invalid utf-8 values are just copied through by the compiler. As 
mentioned: absolutely nothing whatsoever changed in how character sequences are 
interpreted by the compiler in 2.7.x. The explanation you quoted above (and 
which I deleted) applies to both 2.6.x and 2.7.x. I really don't know how I can 
say this in another way, and repeating it clearly doesn't help.

I think it's best if you compile trunk for yourself and test as many scenarios 
as you can, because I feel I cannot add anything further to the discussion, and 
I'm not interested in playing compile bot.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 05 Jan 2013, at 12:53, Paul Ishenin wrote:

 ResourceStrings are stored as AnsiString type with 0 codepage (as I 
 remember). Delphi now stores ResourceStrings as UnicodeString type. I think 
 FPC will follow this in m_default_unicodestring modeswitch.

It would probably even be better to always do that. At least I don't see a 
downside, other than slightly larger binaries (and that's not an issue in this 
case as far as I'm concerned; maintaining two separate resourcestring 
systems/handlers is just not worth the trouble).


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)




On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 12:53, Paul Ishenin wrote:


ResourceStrings are stored as AnsiString type with 0 codepage (as I remember). 
Delphi now stores ResourceStrings as UnicodeString type. I think FPC will 
follow this in m_default_unicodestring modeswitch.


It would probably even be better to always do that. At least I don't see a
downside, other than slightly larger binaries (and that's not an issue in
this case as far as I'm concerned; maintaining two separate resourcestring
systems/handlers is just not worth the trouble).


But it means that for

Resourcestring
  AString = 'Something';

Var
  S : Ansistring;

begin
  S:=AString;
end.

Always a conversion will happen.

I do not think this is a good idea given that currently, String = Ansistring.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote:

 On Sat, 5 Jan 2013, Jonas Maebe wrote:
 
 
 On 05 Jan 2013, at 12:53, Paul Ishenin wrote:
 
 ResourceStrings are stored as AnsiString type with 0 codepage (as I 
 remember). Delphi now stores ResourceStrings as UnicodeString type. I think 
 FPC will follow this in m_default_unicodestring modeswitch.
 
 It would probably even be better to always do that. At least I don't see a
 downside, other than slightly larger binaries (and that's not an issue in
 this case as far as I'm concerned; maintaining two separate resourcestring
 systems/handlers is just not worth the trouble).
 
 But it means that for
 
 Resourcestring
  AString = 'Something';
 
 Var
  S : Ansistring;
 
 begin
  S:=AString;
 end.
 
 Always a conversion will happen.
 
 I do not think this is a good idea given that currently, String = Ansistring.

String will always be shortstring or ansistring in the syntax modes in which 
that is currently the case. And yes, it will involve a conversion in that case. 
Just like every single constant string assignment to an ansistring in 2.6.x in 
case the constant string contains non-ASCII characters and is part of a 
{$codepage xxx} file (because those strings are all stored as unicodestring in 
the program there).

Then again, it will also involve a conversion if the implementation using 
ansistrings is fixed to supported non-ASCII resourcestrings and the system 
codepage is different from the code page in which the resource string has been 
stored by the compiler. In fact, it will then cause two conversions on most 
systems (few systems can directly transcode from arbitrary code page X to 
arbitrary code page Y; most use UTF-16 as intermediate format, although some 
can probably also use UTF-8).

Yes, the exception is probably UTF-8 on Unix systems, but is that really worth 
it to complicate the compiler and RTL? Resourcestings are generally not used in 
performance-critical code, I'd assume. Always using UTF-8 is however also fine 
for me, btw. I just don't believe it is worth the trouble to support both 
unicodestring and ansistring resourcestrings.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Martin Schreiber

On Saturday 05 January 2013 12:57:44 Jonas Maebe wrote:
 On 05 Jan 2013, at 12:53, Martin Schreiber wrote:
  So compiled with -Fcutf8
  
  unicodestringvar:= 'Best'#228'tigung';
  
  produces a different result on fixes_2_6 and trunk? I assume in trunk
  there will be a compile error?

 No. In both cases it results in a widestring with this content:

 .short66,101,115,116,228,116,105,103,117,110,103,0

 I guess invalid utf-8 values are just copied through by the compiler. As
 mentioned: absolutely nothing whatsoever changed in how character sequences
 are interpreted by the compiler in 2.7.x. The explanation you quoted above
 (and which I deleted) applies to both 2.6.x and 2.7.x. I really don't know
 how I can say this in another way, and repeating it clearly doesn't help.

 I think it's best if you compile trunk for yourself and test as many
 scenarios as you can, because I feel I cannot add anything further to the
 discussion, and I'm not interested in playing compile bot.

Then it was a misunderstanding again because I read

Alternatively, in both cases you can instead define a unicodestring/widestring 
constant instead of an ansistring/shortstring constant by embedding widechar 
constants in the character sequence. Such widechar constants are of the form 
#number with number a valid Pascal representation of an integer constant 
between 255 and 65535.

and

Whether or not they contain character literals whose value is #127 in the 
source code's code page, or explicit #xx, #xxx etc expressions has no 
influence, nothing changed in the compiler in that account.

and

I have no idea how anything I wrote suggests that it wouldn't. As mentioned, 
the only difference is that string constants containing characters #127 are 
no longer always converted to unicodestring constants at compile time.

-- #255  #127 and the question arose how can one define widechar 
constants for strings without a character value 255.

Martin

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)


On 05 Jan 2013, at 13:33, Martin Schreiber wrote:

 On Saturday 05 January 2013 12:57:44 Jonas Maebe wrote:
 On 05 Jan 2013, at 12:53, Martin Schreiber wrote:
 So compiled with -Fcutf8
 
 unicodestringvar:= 'Best'#228'tigung';
 
 produces a different result on fixes_2_6 and trunk? I assume in trunk
 there will be a compile error?
 
 No. In both cases it results in a widestring with this content:
 
 .short   66,101,115,116,228,116,105,103,117,110,103,0
 
 I guess invalid utf-8 values are just copied through by the compiler. As
 mentioned: absolutely nothing whatsoever changed in how character sequences
 are interpreted by the compiler in 2.7.x. The explanation you quoted above
 (and which I deleted) applies to both 2.6.x and 2.7.x. I really don't know
 how I can say this in another way, and repeating it clearly doesn't help.
 
 I think it's best if you compile trunk for yourself and test as many
 scenarios as you can, because I feel I cannot add anything further to the
 discussion, and I'm not interested in playing compile bot.
 
 Then it was a misunderstanding again

No, it was simply an omission in my explanation. As mentioned above: I guess 
invalid utf-8 values are just copied through by the compiler. It's a special 
case, but the special case is the same in 2.6.x and 2.7.x (2.6.x converts the 
UTF-8 string to UTF-16 immediately in the scanner, while 2.7.x does it while 
processing the assignment; the actual conversion code that's used is however 
exactly same). The fact that everything remains 100% the same in all cases 
everywhere always between 2.6.x and 2.7.x has been mentioned at least 10 times 
in this thread, and that's what I keep trying to make clear. But I give up.


Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Jy V

 Yes, the exception is probably UTF-8 on Unix systems, but is that really
 worth it to complicate the compiler and RTL? Resourcestings are generally
 not used in performance-critical code, I'd assume. Always using UTF-8 is
 however also fine for me,


I do vote for UTF-8


 btw. I just don't believe it is worth the trouble to support both
 unicodestring and ansistring resourcestrings.


I agree.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

On Sat, 5 Jan 2013, Jonas Maebe wrote:

On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote:

On Sat, 5 Jan 2013, Jonas Maebe wrote:

On 05 Jan 2013, at 12:53, Paul Ishenin wrote:

ResourceStrings are stored as AnsiString type with 0 codepage (as I remember).
Delphi now stores ResourceStrings as UnicodeString type. I think FPC will
follow this in m_default_unicodestring modeswitch.

It would probably even be better to always do that. At least I don't see a
downside, other than slightly larger binaries (and that's not an issue in
this case as far as I'm concerned; maintaining two separate resourcestring
systems/handlers is just not worth the trouble).

But it means that for

Resourcestring
AString = 'Something';

Var
S : Ansistring;

begin
S:=AString;
end.

Always a conversion will happen.

I do not think this is a good idea given that currently, String = Ansistring.

String will always be shortstring or ansistring in the syntax modes in which that is currently the case.
And yes, it will involve a conversion in that case.
Just like every single constant string assignment to an ansistring in 2.6.x in case the constant string
contains non-ASCII characters and is part of a {$codepage xxx} file (because those strings are all stored
as unicodestring in the program there).

Judging by all the code that I have written during 14 years, there would never
be a single conversion necessary.
This system would force them on me for every single use.

I do not think that the support of both ansi/unicode string resources is such a
burden that it justifies that.

I admittedly have limited knowledge of compiler internals, but I cannot imagine that being able to store them
in 2 formats (ansi and some form of unicode) is more than a matter of maintaining 1 flag per string, and writing
a word instead of a byte.

All the other code, needed for conversions depending on codepage and whatnot
settings, is necessary anyway.

Michael.
___
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Sven Barth


On 05.01.2013 14:16, Michael Van Canneyt wrote:



On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote:


On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 12:53, Paul Ishenin wrote:


ResourceStrings are stored as AnsiString type with 0 codepage (as I
remember). Delphi now stores ResourceStrings as UnicodeString type.
I think FPC will follow this in m_default_unicodestring modeswitch.


It would probably even be better to always do that. At least I don't
see a
downside, other than slightly larger binaries (and that's not an
issue in
this case as far as I'm concerned; maintaining two separate
resourcestring
systems/handlers is just not worth the trouble).


But it means that for

Resourcestring
 AString = 'Something';

Var
 S : Ansistring;

begin
 S:=AString;
end.

Always a conversion will happen.

I do not think this is a good idea given that currently, String =
Ansistring.


String will always be shortstring or ansistring in the syntax modes in
which that is currently the case. And yes, it will involve a
conversion in that case. Just like every single constant string
assignment to an ansistring in 2.6.x in case the constant string
contains non-ASCII characters and is part of a {$codepage xxx} file
(because those strings are all stored as unicodestring in the program
there).


Judging by all the code that I have written during 14 years, there would
never be a single conversion necessary.
This system would force them on me for every single use.

I do not think that the support of both ansi/unicode string resources is
such a burden that it justifies that.

I admittedly have limited knowledge of compiler internals, but I cannot
imagine that being able to store them in 2 formats (ansi and some form
of unicode) is more than a matter of maintaining 1 flag per string, and
writing a word instead of a byte.

All the other code, needed for conversions depending on codepage and
whatnot settings, is necessary anyway.


You forget also the code necessary to translate resourcestrings (at 
runtime). Currently the ResourceString related code inside 
rtl/objpas/objpas.pp only handles AnsiString and then this would need to 
be adjusted so that UnicodeString can also be handled. For example there 
will be the need for a SetResourceStrings overload with a 
UnicodeString based TResourceIterator.


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)




On Sat, 5 Jan 2013, Sven Barth wrote:


On 05.01.2013 14:16, Michael Van Canneyt wrote:



On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 13:10, Michael Van Canneyt wrote:


On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 12:53, Paul Ishenin wrote:


ResourceStrings are stored as AnsiString type with 0 codepage (as I
remember). Delphi now stores ResourceStrings as UnicodeString type.
I think FPC will follow this in m_default_unicodestring modeswitch.


It would probably even be better to always do that. At least I don't
see a
downside, other than slightly larger binaries (and that's not an
issue in
this case as far as I'm concerned; maintaining two separate
resourcestring
systems/handlers is just not worth the trouble).


But it means that for

Resourcestring
 AString = 'Something';

Var
 S : Ansistring;

begin
 S:=AString;
end.

Always a conversion will happen.

I do not think this is a good idea given that currently, String =
Ansistring.


String will always be shortstring or ansistring in the syntax modes in
which that is currently the case. And yes, it will involve a
conversion in that case. Just like every single constant string
assignment to an ansistring in 2.6.x in case the constant string
contains non-ASCII characters and is part of a {$codepage xxx} file
(because those strings are all stored as unicodestring in the program
there).


Judging by all the code that I have written during 14 years, there would
never be a single conversion necessary.
This system would force them on me for every single use.

I do not think that the support of both ansi/unicode string resources is
such a burden that it justifies that.

I admittedly have limited knowledge of compiler internals, but I cannot
imagine that being able to store them in 2 formats (ansi and some form
of unicode) is more than a matter of maintaining 1 flag per string, and
writing a word instead of a byte.

All the other code, needed for conversions depending on codepage and
whatnot settings, is necessary anyway.


You forget also the code necessary to translate resourcestrings (at runtime). 
Currently the ResourceString related code inside rtl/objpas/objpas.pp only 
handles AnsiString and then this would need to be adjusted so that 
UnicodeString can also be handled. For example there will be the need for a 
SetResourceStrings overload with a UnicodeString based TResourceIterator.


No, I had I though of that. 
It will need to be changed anyhow, and fell under is necessary anyway, 
since we'll need some kind of backwards-compatibility mechanism.


Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Hans-Peter Diettrich


Michael Van Canneyt schrieb:



On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 12:53, Paul Ishenin wrote:

ResourceStrings are stored as AnsiString type with 0 codepage (as I 
remember). Delphi now stores ResourceStrings as UnicodeString type. I 
think FPC will follow this in m_default_unicodestring modeswitch.


It would probably even be better to always do that. At least I don't 
see a

downside, other than slightly larger binaries (and that's not an issue in
this case as far as I'm concerned; maintaining two separate 
resourcestring

systems/handlers is just not worth the trouble).


But it means that for

Resourcestring
  AString = 'Something';

Var
  S : Ansistring;

begin
  S:=AString;
end.

Always a conversion will happen.

I do not think this is a good idea given that currently, String = 
Ansistring.


IMO resource strings are for display purposes, so that UTF-8/16 encoding 
is expected by an OS API. AFAIR Win32 string resources are stored in 
UTF-16, so that assignments to an AnsiString already require a 
conversion. So IMO UTF-8 would be better, for now and in future.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)




On Sat, 5 Jan 2013, Hans-Peter Diettrich wrote:


Michael Van Canneyt schrieb:



On Sat, 5 Jan 2013, Jonas Maebe wrote:



On 05 Jan 2013, at 12:53, Paul Ishenin wrote:

ResourceStrings are stored as AnsiString type with 0 codepage (as I 
remember). Delphi now stores ResourceStrings as UnicodeString type. I 
think FPC will follow this in m_default_unicodestring modeswitch.


It would probably even be better to always do that. At least I don't see a
downside, other than slightly larger binaries (and that's not an issue in
this case as far as I'm concerned; maintaining two separate resourcestring
systems/handlers is just not worth the trouble).


But it means that for

Resourcestring
  AString = 'Something';

Var
  S : Ansistring;

begin
  S:=AString;
end.

Always a conversion will happen.

I do not think this is a good idea given that currently, String = 
Ansistring.


IMO resource strings are for display purposes, so that UTF-8/16 encoding is 
expected by an OS API.  AFAIR Win32 string resources are stored in UTF-16,


You are very much wrong.

To start with, resource strings are not stored as Win32 resources. 
Secondly, they are stored in the code as an ansistring.


The resource string of the above example is stored as:

.globl  _$PROGRAM$_Ld2
_$PROGRAM$_Ld2:
.ascii  Something\000
.balign 8
.short  0,1
.long   0
.quad   -1,15
.globl  _$PROGRAM$_Ld3

Thirdly: in my apps, no UTF-8/16 encoding is expected by the OS. 
If it were, I would have used widestrings instead of ansistring 
to begin with, and in that case I would not have made any remark...


So the conversion really would be 100% totally redundant.

Michael.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)

2013-01-05 Thread Paul Ishenin


05.01.13, 23:54, Michael Van Canneyt пишет:


You are very much wrong.

To start with, resource strings are not stored as Win32 resources.


I personally think that resources should be stored in their native 
formats where is possible. This will allow to change them using software 
designed for that environment. For example for windows there are many 
resource editors which can replace icons, bitmap and string resources 
too. It would be nice to have this ability also for binaries which FPC 
do. On OS X resources are also stored different from what FPC do 
currently - they are stored in application bundles as I know, so they 
can be edited by external programs too.


Best regards,
Paul Ishenin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String handling in trunk (was utf8 in 2.6.0)