Re: [fpc-devel] assign constant text to widestring

2008-10-24 Thread Michael Schnell


UpperCase, LowerCase, CapitalCase, WordBreak, ParagraphBreak, ...
almost all have some language exceptions.
  
I don't doubt that you are right here, but I don't think that there is 
any support for this in the RTL. So it seems to be a lot less relevant 
than general Unicode handling.


So I thing we first should have decent Unicode support (e.g. assigning 
string constants to WideStrings correctly independent from the code the 
source file is stored in maybe this is a Lazarus bug and FPC can't help 
it as it gets a called erroneously ) correct automatic conversions when 
assigning an UTFString to a WideString and vice-versa.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-24 Thread Jonas Maebe


On 24 Oct 2008, at 01:46, Felipe Monteiro de Carvalho wrote:


I agree with Daniël on this one. Simplify. ë -- Ë always

If you need something which takes into consideration the language then
build another routine with more parameters.


UpperCase and LowerCase are mapped to OS routines which do take into  
account the current locale (at least on *nix). So unless those  
routines always do ë - Ë regardless of the locale, you will not get  
this behaviour.



Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-24 Thread Michael Schnell



Last time I cheked it on Windows no UpperCase is performed for
WideString for codepoints  127, maybe it has been changed recently (1
month). 

With which software ?

In my tests, If I create the WideString correctly (which  is needed to 
be done with an explicit call to a conversion from utf8String), 
uppercase works for the German Umlauts äöü..
With Turbo Delphi it easily works, as WideStrings are correctly 
converted from ANSIStrings automatically.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-24 Thread Jonas Maebe


On 24 Oct 2008, at 13:59, Michael Schnell wrote:


Last time I cheked it on Windows no UpperCase is performed for
WideString for codepoints  127, maybe it has been changed recently  
(1

month).

With which software ?

In my tests, If I create the WideString correctly (which  is needed  
to be done with an explicit call to a conversion from utf8String),  
uppercase works for the German Umlauts äöü..
With Turbo Delphi it easily works, as WideStrings are correctly  
converted from ANSIStrings automatically.


They are in FPC as well, at least if your ansistrings contain ansi- 
encoded strings. I think you are mixing Lazarus and FPC in the above  
(because Lazarus puts utf-8 encoded stuff in ansistrings).



Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-24 Thread Michael Schnell




They are in FPC as well, at least if your ansistrings contain 
ansi-encoded strings. I think you are mixing Lazarus and FPC in the 
above (because Lazarus puts utf-8 encoded stuff in ansistrings).
Sorry, I indeed forgot to write using Lazarus, here Unicode constants 
seem to be not WideStrings (as seemingly with pure FPC) but 
UFT8Strings and this causes the problem.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Florian Klaempfl
Daniël Mantione schrieb:
 The issue might be the UCS-2 encoding of your source, perhaps try to
 feed the compiler UTF-8, I didn't even know the compiler accepts UCS-2,
 it may not work correctly.


The compiler definitively eats no ucs-2 encoded sources.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 A decent system should be able to do the necessary conversions
 automatically:

This is a simplified view which ignores the resource wasting of this
apporoach not visible in the academical example below. The conversion
utf-8-utf-16 is a very expensive operation and the compiler has to
insert it all over the place and people would cry about the performance
of their programs.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell




The compiler definitively eats no ucs-2 encoded sources.
  
I did check several times: My source file looks like this when I open it 
with Ultra-Edit and tell to show it in Hex:

FF FE 75 00 6E 0069 00 74 00 20 00 55 00 6E 00 ..u.n.i.t. .U.n.

Now I created a Delphi program and read the file with TFileStream.

Now I found a utf-8 coded information without a BOM.

So Windows seems to play some nasty tricks on us. Supposedly FPC reads 
the file in a similar way as TFileStream and thus the compiler in fact 
sees utf8.


All this is really nasty stuff !!!

-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Daniël Mantione



Op Thu, 23 Oct 2008, schreef Michael Schnell:





The compiler definitively eats no ucs-2 encoded sources.

I did check several times: My source file looks like this when I open it with 
Ultra-Edit and tell to show it in Hex:

FF FE 75 00 6E 0069 00 74 00 20 00 55 00 6E 00 ..u.n.i.t. .U.n.

Now I created a Delphi program and read the file with TFileStream.

Now I found a utf-8 coded information without a BOM.

So Windows seems to play some nasty tricks on us. Supposedly FPC reads the 
file in a similar way as TFileStream and thus the compiler in fact sees utf8.


The compiler uses blockread and performs no such conversions.

Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell




As has been said before: the compiler itself simply does not support 
UCS-2. Regardless of any BOM, compiler setting or Lazarus setting, it 
will not understand it.
See ,y other post in this thread: Windows XP seems to play some tricks 
on us here so that Ultraedit sees the UCS2 coded file while the compiler 
gets fed with utf8.


Don't think that I understand why/how this happens.

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell



The conversion
utf-8-utf-16 is a very expensive operation and the compiler has to
insert it all over the place and people would cry about the performance
of their programs.

Of course I do agree.

If you want to care about performance you need to know what to do: 
Either use WideString all over the place and beware of the LCL API, or 
use UTF8String  all over the place.


But if you use UTF8String you need to be aware that you can't do simple 
and totally normal things like s := copy(s, 3); to get the first three 
characters of a string. Really finding the first three characters of a 
string is an interesting and time consuming task with utf8 ;) .


That is why I feel that it would be a lot better if  the LCL would use a 
WideString API.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Vincent Snijders

Michael Schnell schreef:



The conversion
utf-8-utf-16 is a very expensive operation and the compiler has to
insert it all over the place and people would cry about the performance
of their programs.

Of course I do agree.

If you want to care about performance you need to know what to do: 
Either use WideString all over the place and beware of the LCL API, or 
use UTF8String  all over the place.


But if you use UTF8String you need to be aware that you can't do simple 
and totally normal things like s := copy(s, 3); to get the first three 
characters of a string. Really finding the first three characters of a 
string is an interesting and time consuming task with utf8 ;) .


That is why I feel that it would be a lot better if  the LCL would use a 
WideString API.


If you want widestring, then maybe mseide is a better option for you.

Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 The conversion
 utf-8-utf-16 is a very expensive operation and the compiler has to
 insert it all over the place and people would cry about the performance
 of their programs.
 Of course I do agree.
 
 If you want to care about performance you need to know what to do:
 Either use WideString all over the place and beware of the LCL API, or
 use UTF8String  all over the place.
 
 But if you use UTF8String you need to be aware that you can't do simple
 and totally normal things like s := copy(s, 3); to get the first three
 characters of a string. Really finding the first three characters of a
 string is an interesting and time consuming task with utf8 ;) .

This is also a simplified view.
- firstly, which real world (!) task really requires to execute an
operation like this, mostly it's something like copy(s,pos(...),...);
- secondly, a properly coded utf-16 application shouldn't do this
either: it doesn't handle surrogates properly and e.g. umlauts can be
encoded in all utf flavours as two chars: base letter plus the umlaut
(the two dots).
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell



utf-16 application shouldn't do this
either: it doesn't handle surrogates properly 
Right you are. For me WideString is UCS2 and not UTF16, as I regard it 
as a sequence of WideChar so that the Unicode user code can be done 
using WideChar and WideString. WideChar only has 16 Bits. So this 
restrict us to Unicode Characters  $.


I doubt that I ever will need to use Unicode Characters  $, but of 
course there _are_ other projects.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Marco van de Voort
In our previous episode, Florian Klaempfl said:
  But if you use UTF8String you need to be aware that you can't do simple
  and totally normal things like s := copy(s, 3); to get the first three
  characters of a string. Really finding the first three characters of a
  string is an interesting and time consuming task with utf8 ;) .
 
 This is also a simplified view.
 - firstly, which real world (!) task really requires to execute an
 operation like this, mostly it's something like copy(s,pos(...),...);
 - secondly, a properly coded utf-16 application shouldn't do this
 either: it doesn't handle surrogates properly and e.g. umlauts can be
 encoded in all utf flavours as two chars: base letter plus the umlaut
 (the two dots).

More importantly, most of such routines will be implicitely tied to a
certain language or language group already.

The idea that UCS2 simply expands the character range, and the rest stays
the same is naieve.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell


Ultraedit might fool you here. Id edits either ansi or usc2. If you 
have a utf8 encoded file, it will show the contents in hex as being ucs2
That might be. But it would even virtually insert a BOPM ?!?!?!? Why 
should it do this when using the hex editor ?


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell



More importantly, most of such routines will be implicitely tied to a
certain language or language group already.
  
Which kind of UCS2 based function do you think are tied to a 
language(group) ?


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Jonas Maebe


On 23 Oct 2008, at 13:41, Michael Schnell wrote:


utf-16 application shouldn't do this
either: it doesn't handle surrogates properly
Right you are. For me WideString is UCS2 and not UTF16, as I regard  
it as a sequence of WideChar so that the Unicode user code can be  
done using WideChar and WideString. WideChar only has 16 Bits. So  
this restrict us to Unicode Characters  $.


I doubt that I ever will need to use Unicode Characters  $, but  
of course there _are_ other projects.


I doubt that you will never need to support decomposed characters  
(such as ä being encoded as basically a¨). It's not that uncommon.



Jonas___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 More importantly, most of such routines will be implicitely tied to a
 certain language or language group already.
   
 Which kind of UCS2 based function do you think are tied to a
 language(group) ?

Bidi stuff? You are aware of the fact that unicode strings can contain
e.g. bidi markers?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Martin Schreiber
On Thursday 23 October 2008 13.31:30 Florian Klaempfl wrote:

 This is also a simplified view.
 - firstly, which real world (!) task really requires to execute an
 operation like this, mostly it's something like copy(s,pos(...),...);
 - secondly, a properly coded utf-16 application shouldn't do this
 either: it doesn't handle surrogates properly and e.g. umlauts can be
 encoded in all utf flavours as two chars: base letter plus the umlaut
 (the two dots).

One should normalize unicode text before processing. If normalized to fully 
composed form there will be no problems with UCS2 single character processing 
in Western Europe. The GUI kit should return fully composed characters when 
ever possible to simplify the users life.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Marc Weustink

Michael Schnell wrote:


Ultraedit might fool you here. Id edits either ansi or usc2. If you 
have a utf8 encoded file, it will show the contents in hex as being ucs2
That might be. But it would even virtually insert a BOPM ?!?!?!? Why 
should it do this when using the hex editor ?


Since it converts the UTF8 file internally to UCS2 on read before editing.

Marc

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell



Bidi stuff? You are aware of the fact that unicode strings can contain
e.g. bidi markers?

Sorry, never heard of bidi :(

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Florian Klaempfl
Michael Schnell schrieb:
 
 Bidi stuff? You are aware of the fact that unicode strings can contain
 e.g. bidi markers?
 Sorry, never heard of bidi :(
 

http://www.unicode.org/reports/tr9/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell



If you want widestring, then maybe mseide is a better option for you.
Again I do know this, and I in fact don't have a project that needs 
Unicode. But the cause why I started this thread is to help making 
Lazarus / FPC even more useful.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell


Since it converts the UTF8 file internally to UCS2 on read before 
editing.

Seems really silly to me.

But the file length really indicated that it's utf8 coded and when 
looking at the file with WinCommander's hex viewer it's utf-8. So I 
suppose that you are right and the nasty trick is Ultraedit's.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Martin Schreiber
On Thursday 23 October 2008 13.58:04 Michael Schnell wrote:
  Bidi stuff? You are aware of the fact that unicode strings can contain
  e.g. bidi markers?

 Sorry, never heard of bidi :(

Bidirectional text. Much more important than the hypothetical codepoints above 
the BMP. MSEgui does not support bidi BTW, too difficult. ;-)

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell




I doubt that you will never need to support decomposed characters 
(such as ä being encoded as basically a¨). It's not that uncommon.

This is the nasty old stuff Unicode should be useful to get rid of

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Marc Weustink

Michael Schnell wrote:


Since it converts the UTF8 file internally to UCS2 on read before 
editing.

Seems really silly to me.


No it's not. This way you have internally only to support 2 editors. One 
with bytechars and one with wordchars (ignoring surrogates and other stuff)


But the file length really indicated that it's utf8 coded and when 
looking at the file with WinCommander's hex viewer it's utf-8. So I 
suppose that you are right and the nasty trick is Ultraedit's.


Yes, since auto conversion by the OS i find very unlikely

(yes I once tripped over this with ultraedit too)

Marc
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Michael Schnell



http://www.unicode.org/reports/tr9/
  
Thanks. I see. (In fact I even did do embedded software for a display 
that can show Hebrew text. But this was with ANSI code.)


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread listmember

 DM  Example: In Dutch uppercase characters generally do not get
 tremas: Daniël becomes DANIEL. Should an uppercase routine worry?
 No, this is a spelling convention, the correct uppercase of ë is
 Ë, we should not confuse spelling with uppercasing.

No. This is not a spelling convention. It is a rule dictated by the 
language the word is written in.


If the word Daniël is Dutch, then its uppercase is:

UpperCase(Daniël, langDutch) -- DANIEL

Fine.

Yet, if we dont know what lang it is written, then the uppercase is:

UpperCase(Daniël, langUndefined) -- DANIËL

Now.. as I don't know Dutch at all, I wonder what the LowerCase 
transforms would be for the same uppercased word, DANIEL


LowerCase(DANIEL, langDutch) -- daniel

or,

LowerCase(DANIEL, langDutch) -- daniël

or both?

If both, how do you pick the correct one?

 Example also, in spanish sólo is different than SOLO and meaning 
is different ( alone  only ).


Yes, it is impretative that we know the language of the word is in, so that

UpperCase(sólo, langSpanish) -- SÓLO
UpperCase(solo, langSpanish) -- SOLO

Otherwise, we may end up altering the meaning of the text.

UpperCase(), LowerCase() should not alter the meaning of the text.

This would be a crime in any other context.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread Felipe Monteiro de Carvalho
I agree with Daniël on this one. Simplify. ë -- Ë always

If you need something which takes into consideration the language then
build another routine with more parameters.

-- 
Felipe Monteiro de Carvalho
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] assign constant text to widestring

2008-10-23 Thread listmember

On 2008-10-24 02:46, Felipe Monteiro de Carvalho wrote:

I agree with Daniël on this one. Simplify. ë --  Ë always

If you need something which takes into consideration the language then
build another routine with more parameters.


It's not that simple.

How would you uppercase this piece of string

In Dutch uppercase characters generally do not get tremas: Daniël 
becomes DANIEL.


correctly unless you knew the substring Daniël is in Dutch and while 
the rest is in English

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel