Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Michael Schnell



I don't think,  full UTF-16 really would be desirable desirable over UC-2.

Imagine you have a string of some million characters (e.g. a Book). All 
functions that need to find the n-th character (like x[n], copy, ...) 
would take forever, as they need to scan the complete string (if not 
widestring is a rather complex tree-like format).



That is a solution to isolate such code and treat it different from the
rest, not to mutilate the unicode standard.
  
I just checked Turbo Delphi (which does have WideString operations, but 
e.g. TMemo works just on normal strings, so WideStrings are concerted to 
plain old ANSI strings when used with TMemo.Lines) on that behalf:


sizeof WideChar in fact is 2 (16 Bits).

Dumping a WideString shows that in fact the storage area of WideString 
an array of 16 bit WideChars.


So WideString just uses UC2 (and not Unicode).

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Michael Schnell



The encoding can be important for speed:
For example the widestring xml parser is up to 10 times slower than
the ansistring xml parser.
  
That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit) 
instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar is a 16 bit 
(UCS-2) value).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Michael Schnell



s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
  
It would work, but it would need an implementation that moves the tail 
of the string around and thus would be really slow.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Burkhard Carstens
Am Montag, 29. September 2008 09:25 schrieb Michael Schnell:
  The encoding can be important for speed:
  For example the widestring xml parser is up to 10 times slower than
  the ansistring xml parser.

 That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
 instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar is a 16
 bit (UCS-2) value).

You didn't read  http://www.jacobthurman.com/?p=30 , did you?

regards
 Burkhard

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Michael Schnell



That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar is a 16
bit (UCS-2) value).



You didn't read  http://www.jacobthurman.com/?p=30 , did you?
  
They are talking about Delphi 2009, of which I don't have any 
information at all (and don't intend to bother with until there is a 
free Turbo version of if).


I just talked about the current free Turbo Delphi version which 
obviously uses UCS-2 (plain 16 bits) and not any UTF (variable size) 
coding.


As discussed in the messages here, any UTF coding would result in a huge 
overhead e.g. when doing something like s[4] := c; (s: WideString; c: 
WideChar) as the (potentially huge) tail of the string would need to be 
moved around according to the different sizes of the codes of the 
previous s[4] and c.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Ivo Steinmann
Michael Schnell schrieb:

 The encoding can be important for speed:
 For example the widestring xml parser is up to 10 times slower than
 the ansistring xml parser.
   
 That obviously is the reason why Turbo - Delphi uses UCS-2 (16 bit)
 instead of OF UTF-8 or UTF-16 for WideStrings (and WideChar is a 16
 bit (UCS-2) value).

 -Michael
 ___
 fpc-devel maillist  -  fpc-devel@lists.freepascal.org
 http://lists.freepascal.org/mailman/listinfo/fpc-devel

are you sure they are using UCS2 and not some 16bit codepages? That
exists also ;)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-29 Thread Michael Schnell



are you sure they are using UCS2 and not some 16bit codepages? That
exists also ;)


Not really.

I checked the unicodes 0x0100 and 0x0101 (capital and lower case A 
with a dash). Same can correctly be viewed in the debugger when pointing 
to the WideString variable. So I guess it indeed is unicode.


-Michael


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-28 Thread Martin Schreiber
On Sunday 28 September 2008 00.10:43 Graeme Geldenhuys wrote:
 On Fri, Sep 26, 2008 at 5:02 PM, Mattias Gaertner

 [EMAIL PROTECTED] wrote:
  s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
 
  In short:
  A single character for all purposes can not be defined. Unicode can not
  be handled as array of character.

 This is what I thought, but everybody seems to side step the answer.
 Thanks Mattias for confirming this. Like I told Martin in one of my
 replies. In the last four years I have not needed indexing into a
 character array, and if I have to parse a string, it's normally
 sequential anyway, which is then easy to track each charter in UTF-8,
 even if multi-byte characters are used.


Note that UTF8CharAtByte() won't work work in Mattias example neither.
It seems that Apple decided to use two characters from the BMP to denote 
umlauts.
Example for ä (U+00E4 LATIN SMALL LETTER A WITH DIARESIS):
a (U+0061 LATIN SMALL LETTER A) followed by ¨ (U+0308, COMBINIG DIARESIS). 
Mattias please correct me if I am wrong.
So the problem is not that the characters don't fit in the UCS2 range, the 
problem is that Apple use the decomposed forms of umlauts.
If you work with OS X HFS you must convert to the composed normal form if fpGUI 
uses the composed form internally before processing the filenames in fpGUI.
This is independent of using utf-8, utf-16, utf-32 or UCS2. You need conversion 
tables to do so and again, it is easier to handle with widestrings instead of 
utf-8 strings if you don't need characters which don't fit into BMP.
And even if you want to support the full Unicode code point range it is simpler 
with utf-16 because there are surrogate *pairs* only.

In MSEgui I would implement the normalization into the MSEgui filename 
routines, MSEgui uses a normalized cross platform filename scheme anyway.
Win32 'c:\\bbb.ext' will be normalized to MSEgui form '/c://bbb.ext', 
Unicode composed normalization can be done in the same step.

An article about Unicode normalization:

http://en.wikipedia.org/wiki/Unicode_normalization

Martin___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-28 Thread Mattias Gaertner
On Sun, 28 Sep 2008 09:23:14 +0200
Martin Schreiber [EMAIL PROTECTED] wrote:

 On Sunday 28 September 2008 00.10:43 Graeme Geldenhuys wrote:
  On Fri, Sep 26, 2008 at 5:02 PM, Mattias Gaertner
 
  [EMAIL PROTECTED] wrote:
   s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.
  
   In short:
   A single character for all purposes can not be defined. Unicode
   can not be handled as array of character.
 
  This is what I thought, but everybody seems to side step the answer.
  Thanks Mattias for confirming this. Like I told Martin in one of my
  replies. In the last four years I have not needed indexing into a
  character array, and if I have to parse a string, it's normally
  sequential anyway, which is then easy to track each charter in
  UTF-8, even if multi-byte characters are used.
 
 
 Note that UTF8CharAtByte() won't work work in Mattias example neither.
 It seems that Apple decided to use two characters from the BMP to
 denote umlauts. Example for ä (U+00E4 LATIN SMALL LETTER A WITH
 DIARESIS): a (U+0061 LATIN SMALL LETTER A) followed by ¨ (U+0308,
 COMBINIG DIARESIS). Mattias please correct me if I am wrong.

You are right. (I didn't check the exact values.)

 So the problem is not that the characters don't fit in the UCS2
 range, the problem is that Apple use the decomposed forms of umlauts.

Well, in case of a-umlaut you are right. But not in general. It
only means, that you can not use UCS2 or whatever directly. You must
convert. And the conversion can not be done trivially with some
s[i]:='x'.
Do you think Apple is so stupid to use the decomposed form, if the
composed form is equivalent?


 If you work with OS X HFS you must convert to the composed normal
 form if fpGUI uses the composed form internally before processing the
 filenames in fpGUI. This is independent of using utf-8, utf-16,
 utf-32 or UCS2. You need conversion tables to do so and again, it is
 easier to handle with widestrings instead of utf-8 strings if you
 don't need characters which don't fit into BMP. And even if you want
 to support the full Unicode code point range it is simpler with
 utf-16 because there are surrogate *pairs* only.

HFS+ uses something similar to NFD, with some differences for
historical reasons. It is recommended to *not* convert on your own and
use the apple functions. They support UTF-8, the various UTF-16
encodings and some more.

 
 In MSEgui I would implement the normalization into the MSEgui
 filename routines, MSEgui uses a normalized cross platform filename
 scheme anyway. 

You can not normalize the composed and decomposed state platform
independently. For example Linux ext3 does not normalize in any
way and therefore distinguish between composed a-umlaut and decomposed
a-umlaut. You can even use invalid UTF-8 sequences.


 Win32 'c:\\bbb.ext' will be normalized to MSEgui
 form '/c://bbb.ext', Unicode composed normalization can be done
 in the same step.

Is this normalized form used only internally in msegui or must the user
use them too?

 
 An article about Unicode normalization:
 
 http://en.wikipedia.org/wiki/Unicode_normalization

Thanks.
Unicode is really a zoo. The page shows that the encoding is the least
problem of unicode.


Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-28 Thread Graeme Geldenhuys
On Sun, Sep 28, 2008 at 12:22 PM, Mattias Gaertner
[EMAIL PROTECTED] wrote:

 You can not normalize the composed and decomposed state platform
 independently. For example Linux ext3 does not normalize in any
 way and therefore distinguish between composed a-umlaut and decomposed
 a-umlaut. You can even use invalid UTF-8 sequences.

And the plot thickens...  :-)

 Win32 'c:\\bbb.ext' will be normalized to MSEgui
 form '/c://bbb.ext', Unicode composed normalization can be done
 in the same step.

 Is this normalized form used only internally in msegui or must the user
 use them too?

I remember when I tried a MSEgui version some time back, that the IDE
itself used that normalized form filenames. I think any file select
dialogs etc uses that. I first thought it was a bug and reported it,
and was told it's normal.

I don't know if more recent versions of MSEgui has changed or not.
All I can say is that from a user perspective, those filenames are
weird. ;-)


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-28 Thread Martin Schreiber
On Sunday 28 September 2008 20.16:36 Graeme Geldenhuys wrote:
 On Sun, Sep 28, 2008 at 12:22 PM, Mattias Gaertner

  Is this normalized form used only internally in msegui or must the user
  use them too?

 I remember when I tried a MSEgui version some time back, that the IDE
 itself used that normalized form filenames. I think any file select
 dialogs etc uses that. I first thought it was a bug and reported it,
 and was told it's normal.

 I don't know if more recent versions of MSEgui has changed or not.
 All I can say is that from a user perspective, those filenames are
 weird. ;-)

It has not changed. You can either enter the system specific or the MSEgui 
form of filenames into filename widgets, they eat both. :-)
On Linux the system and the MSEgui filenames are the same.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-27 Thread Luiz Americo Pereira Camara

Graeme Geldenhuys wrote:

(AFAI understand, a Widechar is just 16 bit, it would need to
be 32 bit if surrogates were allowed in Widestrings).



Good question and I have been wondering about this myself.  In D2009
SizeOf(Char) = 2, so I have no idea how that works with surrogate
pairs. Can anybody explain this please?
  

In http://www.jacobthurman.com/?p=30 you can find some explanation.

In http://blogs.codegear.com/abauer/2008/01/09/38845 tries to explain 
why UTF16


Luiz


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-27 Thread Graeme Geldenhuys
On Sat, Sep 27, 2008 at 2:35 PM, Luiz Americo Pereira Camara
[EMAIL PROTECTED] wrote:
 Good question and I have been wondering about this myself.  In D2009
 SizeOf(Char) = 2, so I have no idea how that works with surrogate
 pairs. Can anybody explain this please?


 In http://www.jacobthurman.com/?p=30 you can find some explanation.

Thank you!  This link was very informative and it explains it in
plain, easy to understand terms.  :-)


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:

 Who says that? UTF-16 is simply chosen because it has features (supporting
 all characters basically) ANSI doesn't?

Sorry, my message was unclear and I got somewhat mixed up between ANSI
and UTF-8. I meant the encoding type of String or UnicodeString being
UTF-16 instead of UTF-8.  The CodeGear newsgroups are full of people
saying that UTF-16 was chosen because they could call the 'W' api's
without needing a conversion.

My question is, has anybody actually seen the speed difference (actual
timing results) showing UTF-16 string calling 'W' api's compared to
UTF-8-UTF-16 and then calling the 'W' api's.  With today's computers,
I can't imagine that there would be a significant speed loss using
such conversions. The speed difference might be milliseconds, but
that's not really significant speed loss is it?

So has anybody actually done a timing comparision? Do you have your
test code available? Do you have your results published? I'm
interested to see the timing results using different hardware.

I suppose it would be viable doing timing results for saving text
files as well. After all, 99% of the time, text files are stored in
UTF-8. So in D2009 you would first have to convert UTF-16 to UTF-8 and
then save. And the opposite when reading, plus checking for the byte
order marker.  If you used UTF-8 for the String encoding no
conversions are required and no byte order marker checks needed.

Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Daniël Mantione



Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:


On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
[EMAIL PROTECTED] wrote:


Who says that? UTF-16 is simply chosen because it has features (supporting
all characters basically) ANSI doesn't?


Sorry, my message was unclear and I got somewhat mixed up between ANSI
and UTF-8. I meant the encoding type of String or UnicodeString being
UTF-16 instead of UTF-8.  The CodeGear newsgroups are full of people
saying that UTF-16 was chosen because they could call the 'W' api's
without needing a conversion.

My question is, has anybody actually seen the speed difference (actual
timing results) showing UTF-16 string calling 'W' api's compared to
UTF-8-UTF-16 and then calling the 'W' api's.  With today's computers,
I can't imagine that there would be a significant speed loss using
such conversions. The speed difference might be milliseconds, but
that's not really significant speed loss is it?


I think the main speed issue with UTF-8 is the speed of procedures like 
val. A val which accepts both western and Arabic digits would be 
significantly more complex and therefore slower in UTF-8 than in UTF-16.



I suppose it would be viable doing timing results for saving text
files as well. After all, 99% of the time, text files are stored in
UTF-8. So in D2009 you would first have to convert UTF-16 to UTF-8 and
then save. And the opposite when reading, plus checking for the byte
order marker.  If you used UTF-8 for the String encoding no
conversions are required and no byte order marker checks needed.


For me the speed of input/output is less relevant, this is limited by disk 
speed anyway. It's the speed of processing that should be decisive.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Vincent Snijders

Graeme Geldenhuys schreef:

On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
I suppose it would be viable doing timing results for saving text
files as well. After all, 99% of the time, text files are stored in
UTF-8. 


Where did you get that number (99%) from? I don't think that is true, 
except maybe, if you count all ASCII files as UTF8 too.


Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 9:04 AM, Graeme Geldenhuys
[EMAIL PROTECTED] wrote:

 So has anybody actually done a timing comparision? Do you have your
 test code available? Do you have your results published? I'm
 interested to see the timing results using different hardware.


What I'm getting at, is that if FPC implements a UnicodeString based
on UTF-16, compared to UTF-8. It would be nice to base that decision
on educated research and not just a hunch that UTF-16 will be faster
on 1 of the 11 officially supported platforms.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Florian Klaempfl
Graeme Geldenhuys schrieb:
 On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
 [EMAIL PROTECTED] wrote:
 Who says that? UTF-16 is simply chosen because it has features (supporting
 all characters basically) ANSI doesn't?
 
 Sorry, my message was unclear and I got somewhat mixed up between ANSI
 and UTF-8. I meant the encoding type of String or UnicodeString being
 UTF-16 instead of UTF-8.  The CodeGear newsgroups are full of people
 saying that UTF-16 was chosen because they could call the 'W' api's
 without needing a conversion.
 
 My question is, has anybody actually seen the speed difference (actual
 timing results) showing UTF-16 string calling 'W' api's compared to
 UTF-8-UTF-16 and then calling the 'W' api's.  With today's computers,
 I can't imagine that there would be a significant speed loss using
 such conversions. The speed difference might be milliseconds, but
 that's not really significant speed loss is it?

Windows has no utf-8 string processing routines so any case conversion,
comparision whatever needs an utf-8 - utf-16 conversion.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Aleksa Todorovic
On Fri, Sep 26, 2008 at 09:04, Graeme Geldenhuys
[EMAIL PROTECTED] wrote:
 On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
 [EMAIL PROTECTED] wrote:


 I suppose it would be viable doing timing results for saving text
 files as well. After all, 99% of the time, text files are stored in
 UTF-8. So in D2009 you would first have to convert UTF-16 to UTF-8 and
 then save. And the opposite when reading, plus checking for the byte
 order marker.  If you used UTF-8 for the String encoding no
 conversions are required and no byte order marker checks needed.


That is true. But, on the other hand, 99% of your time, your
application will work with string in memory, and only 1% of time will
be spend on I/O. (Ok, this is for normal application, special cases
like databases are special cases anyway). I don't really think that
file encoding is strong argument regarding internal string
representation. When you read text file, it's inevitable that you'll
parse it in some way. And parsing is lot more slower than simple
character conversions.

I support decision of using UTF-16 over UTF-8. String processing is
far more simpler, it's actually as simple as it should be. Have you
ever done any serious processing using UTF-8? It's not nightmare, but
it's surely real pain. No such problems with UTF-16. You don't need to
thing about encodings  conversions all the time.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 9:12 AM, Daniël Mantione
[EMAIL PROTECTED] wrote:

 For me the speed of input/output is less relevant, this is limited by disk
 speed anyway. It's the speed of processing that should be decisive.

That's highly dependant on what you application does!  If your
application primarily parses text files, it's relevant. :-)


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Florian Klaempfl
Graeme Geldenhuys schrieb:
 On Fri, Sep 26, 2008 at 9:04 AM, Graeme Geldenhuys
 [EMAIL PROTECTED] wrote:
 So has anybody actually done a timing comparision? Do you have your
 test code available? Do you have your results published? I'm
 interested to see the timing results using different hardware.
 
 
 What I'm getting at, is that if FPC implements a UnicodeString based
 on UTF-16, compared to UTF-8. It would be nice to base that decision
 on educated research and not just a hunch that UTF-16 will be faster
 on 1 of the 11 officially supported platforms.

Being honest, imo UTF-8 is only a hack to get unicode on platforms like
unix. Further, processing UTF-16 is much easier, for a lot of
applications faster and for important encodings like chinese more memory
efficient. If UTF-8 was easy to handle, we wouldn't have to convert
everything to UTF-32 on unix to do case conversations, comparisations etc.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 9:19 AM, Aleksa Todorovic [EMAIL PROTECTED] wrote:
 I support decision of using UTF-16 over UTF-8. String processing is
 far more simpler, it's actually as simple as it should be.

And that's guarenteed to work with surrogate pairs as well? The
problem is, most people assume UTF-16 = UCS2 and never both to check
if surrogate pairs are well supported - irrespective if most languages
incidentally fall in the BMP.

 Have you
 ever done any serious processing using UTF-8? It's not nightmare, but
 it's surely real pain. No such problems with UTF-16. You don't need to
 thing about encodings  conversions all the time.

Well if you have Utf-8 versions of all basic string processing
functions like Pos, Length, Copy, Insert etc you don't have to think
of encoding or anything. fpGUI uses UTF-8 internally, and I never have
to think about what encoding I'm working with. I assume Lazarus LCL is
the same.

Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Florian Klaempfl
Graeme Geldenhuys schrieb:
 On Fri, Sep 26, 2008 at 9:27 AM, Florian Klaempfl
 [EMAIL PROTECTED] wrote:
 Being honest, imo UTF-8 is only a hack to get unicode on platforms like
 unix.
 
 I don't know where you get that information, 

Rather simple: initially in unicode 1.0 there was only a 16 bit encoding.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Graeme Geldenhuys said:
 Yes I know we have had lengthy discussions about this before.
 Everybody (whoever they might be) keeps saying that UTF-16 was chosen
 for Tiburon's UnicodeString because it makes significant speed gains
 when calling the Windows API based on UTF-16 - compared to the ANSI
 API's. The whole debate goes that you wouldn't need constant
 conversions between ANSI-UTF-16-ANSI.  Now it seems Free Pascal
 developers want to base their design on those results as well (yes,
 plus the whole compatibility thing)

Well, I discussed with Florian and Michael (and Felipe also a bit) what to
do with unicode, and I'm pretty sure speed on Windows API calls wasn't even
mentioned.
 
 Marco Cantu, as far as I can see, is the only one that shows a
 comparison and numbers. Surprisingly, the ANSI calls where faster!

Most of this is based on the old NT books from Butler c.s. where they say
all calls of NT are unicode, and the ascii ones are wrappers. However this
is aeons old, and memory constrains are less, so maybe there are two sets
now. Nobody knows.

As far as Cantu goes, be very,very careful with benchmarking:

Cantu himself says this is due to repainting. Maybe ansistrings with
CP_UTF8 the repainting is also slower. IOW it is unicode widget painting
(any encoding) vs ansi widget painting.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Aleksa Todorovic said:
  I suppose it would be viable doing timing results for saving text
  files as well. After all, 99% of the time, text files are stored in
  UTF-8. So in D2009 you would first have to convert UTF-16 to UTF-8 and
  then save. And the opposite when reading, plus checking for the byte
  order marker.  If you used UTF-8 for the String encoding no
  conversions are required and no byte order marker checks needed.
 
 That is true. But, on the other hand, 99% of your time, your
 application will work with string in memory, and only 1% of time will
 be spend on I/O. 

This is not true. Working with Database exports (simple transformations,
pump functionality) is a quite normal task for a programmer.

 I support decision of using UTF-16 over UTF-8. String processing is
 far more simpler, it's actually as simple as it should be. Have you
 ever done any serious processing using UTF-8? It's not nightmare, but
 it's surely real pain. No such problems with UTF-16. 

It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.

All also note that there hasn't been a final decision about UTF-16 only. The
original idea was to have a multi encoding string, but that got stricken
because Tiburon reality crashed in.

Tiburon actually also does this, it has a way of dealing with UTF-8
automated too.

IMHO any system should allow to generally work with strings in the native
encoding. Which means UTF-8 on *nix.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Florian Klaempfl said:
  On Fri, Sep 26, 2008 at 9:27 AM, Florian Klaempfl
  [EMAIL PROTECTED] wrote:
  Being honest, imo UTF-8 is only a hack to get unicode on platforms like
  unix.
  
  I don't know where you get that information, 
 
 Rather simple: initially in unicode 1.0 there was only a 16 bit encoding.

Problem is that UTF-16 is just the same hack. And they couldn't move to
UTF-32 since it is so memory hungry.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Michael Schnell



Well if you have Utf-8 versions of all basic string processing
functions like Pos, Length, Copy, Insert etc 

s[i] := 'x'; will be especially funny :).

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Michael Schnell



It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.
  
Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring 
the surrogates ? (AFAI understand, a Widechar is just 16 bit, it would 
need to be 32 bit if surrogates were allowed in Widestrings).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 10:43 AM, Michael Schnell [EMAIL PROTECTED] wrote:

 It's no different then UTF-16 if you want to do it properly. In both you
 have to look out for surrogates.


 Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
 surrogates ?

Lets hope not, because then it would be UCS-2 and NOT UTF-16! As far
as I know D2009 (I think) handles this correctly, but I have no idea
how.

 (AFAI understand, a Widechar is just 16 bit, it would need to
 be 32 bit if surrogates were allowed in Widestrings).

Good question and I have been wondering about this myself.  In D2009
SizeOf(Char) = 2, so I have no idea how that works with surrogate
pairs. Can anybody explain this please?


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Daniël Mantione



Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:


On Fri, Sep 26, 2008 at 9:12 AM, Daniël Mantione
[EMAIL PROTECTED] wrote:


For me the speed of input/output is less relevant, this is limited by disk
speed anyway. It's the speed of processing that should be decisive.


That's highly dependant on what you application does!  If your
application primarily parses text files, it's relevant. :-)


Shortstrings  ansistrings won't go away. You'll still be able to code 
fast text file parsers. Note that in such cases your application won't 
process unicode, taking the numbers example again: As soon as your 
application accepts arabic numbers everywhere western numbers are allowed, 
you want the parsing to happen in UTF-16.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Michael Schnell said:
  It's no different then UTF-16 if you want to do it properly. In both you
  have to look out for surrogates.

 Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring 
 the surrogates ? 

No different as UTF-8 in principle. Base routines keep surrogate pairs
intact if you don't use them wrongly.

(AFAI understand, a Widechar is just 16 bit, it would 

And a char is 8 -bit the granularity of UTF-8 without surrogates. IOW it is
orthogonal.

 need to be 32 bit if surrogates were allowed in Widestrings).

No it doesn't, Windows supports surrogates, and so does afaik Tiburon. It is
just that they chose the granularity of [] to be the granularity of the
encoding rather than char based.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Ivo Steinmann
Graeme Geldenhuys schrieb:
 On Thu, Sep 25, 2008 at 10:33 PM, Florian Klaempfl
 [EMAIL PROTECTED] wrote:
   
 Who says that? UTF-16 is simply chosen because it has features (supporting
 all characters basically) ANSI doesn't?
 

 Sorry, my message was unclear and I got somewhat mixed up between ANSI
 and UTF-8. I meant the encoding type of String or UnicodeString being
 UTF-16 instead of UTF-8.  The CodeGear newsgroups are full of people
 saying that UTF-16 was chosen because they could call the 'W' api's
 without needing a conversion.

 My question is, has anybody actually seen the speed difference (actual
 timing results) showing UTF-16 string calling 'W' api's compared to
 UTF-8-UTF-16 and then calling the 'W' api's.  With today's computers,
 I can't imagine that there would be a significant speed loss using
 such conversions. The speed difference might be milliseconds, but
 that's not really significant speed loss is it?

 So has anybody actually done a timing comparision? Do you have your
 test code available? Do you have your results published? I'm
 interested to see the timing results using different hardware.

 I suppose it would be viable doing timing results for saving text
 files as well. After all, 99% of the time, text files are stored in
 UTF-8. So in D2009 you would first have to convert UTF-16 to UTF-8 and
 then save. And the opposite when reading, plus checking for the byte
 order marker.  If you used UTF-8 for the String encoding no
 conversions are required and no byte order marker checks needed.

 Regards,
   - Graeme -
   

In the core of all windows nt systems, there's the NT API. The normal
WinAPI is on the top of the NTAPI. the NT API itself uses UTF-16 as
stringtype!

type
  UNICODE_STRING = record
Length: USHORT;
MaximumLength: USHORT;
Buffer: PWSTR;
  end;

const
  FileShareMode = FILE_SHARE_READ or FILE_SHARE_WRITE or FILE_SHARE_DELETE;
var
  str: UNICODE_STRING;  { utf16 type from ntapi }
  attr: OBJECT_ATTRIBUTES;
  io: IO_STATUS_BLOCK;
  ntmode: Integer;
  Handle: longword;
begin
  attr.Length := sizeof(attr);
  attr.RootDirectory := 0;
  attr.Attributes := 0;
  attr.ObjectName := @str;
  attr.SecurityDescriptor := nil;
  attr.SecurityQualityOfService := nil;

  NtOpenFile(@Handle, ntmode, @attr, @io, FileShareMode,
FILE_NON_DIRECTORY_FILE or FILE_SYNCHRONOUS_IO_NONALERT)
end;



So in core, winnt is working with UTF16. All ANSI Winapi functions map
to these winnt calls.

-Ivo Steinmann

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Dani?l Mantione said:
  That's highly dependant on what you application does!  If your
  application primarily parses text files, it's relevant. :-)
 
 Shortstrings  ansistrings won't go away. You'll still be able to code 
 fast text file parsers. Note that in such cases your application won't 
 process unicode, taking the numbers example again: As soon as your 
 application accepts arabic numbers everywhere western numbers are allowed, 
 you want the parsing to happen in UTF-16.

Accepting both Arabic and Westernized Arabic numerals would possibly break a
lot of code anyway, since to string and back wouldn't be reversible. (it
actually already isn't with Delphi I know, due to hex and padding handling,
but this would be a magnitude worse)

You can't seperate val from str, and what would str(100,s) do?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Daniël Mantione



Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:


On Fri, Sep 26, 2008 at 10:43 AM, Michael Schnell [EMAIL PROTECTED] wrote:



It's no different then UTF-16 if you want to do it properly. In both you
have to look out for surrogates.



Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
surrogates ?


Lets hope not, because then it would be UCS-2 and NOT UTF-16! As far
as I know D2009 (I think) handles this correctly, but I have no idea
how.


Let me put it like this: Someone writing a Russian/Arabic/Japanese spell 
checker does not have to handle surrogates with UTF-16, but he does with 
UTF-8, i.e. UTF-16 is much better for them than UTF-8.


Someone writing a spell checker for old-Egyptian Hieroglyphs will have to 
deal with surrogates. For those people UTF-16 has few advantages over 
UTF-8, (allthough in practice it's still a bit easier to handle than UTF-8).


Russian, Arabic, Japanese are languages in daily use on computers, 
countless electronic documents in these languages exist. There is a 
huge interrest in software handling it, and therefore it's worth spending 
our valuable time on. Egyptian Hieroglyphs are not worth spending our 
valuable time on.


Some UTF-16 support should come by default, like UTF-8 - UTF-16 
conversion. In many situations it will not be necessary to bother with 
surrogates at all. In some situations we may just accept patches if 
someone is interrested.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Ivo Steinmann
Ivo Steinmann schrieb:

 In the core of all windows nt systems, there's the NT API. The normal
 WinAPI is on the top of the NTAPI. the NT API itself uses UTF-16 as
 stringtype!

 type
   UNICODE_STRING = record
 Length: USHORT;
 MaximumLength: USHORT;
 Buffer: PWSTR;
   end;

 const
   FileShareMode = FILE_SHARE_READ or FILE_SHARE_WRITE or FILE_SHARE_DELETE;
 var
   str: UNICODE_STRING;  { utf16 type from ntapi }
   attr: OBJECT_ATTRIBUTES;
   io: IO_STATUS_BLOCK;
   ntmode: Integer;
   Handle: longword;
 begin
   attr.Length := sizeof(attr);
   attr.RootDirectory := 0;
   attr.Attributes := 0;
   attr.ObjectName := @str;
   attr.SecurityDescriptor := nil;
   attr.SecurityQualityOfService := nil;

   NtOpenFile(@Handle, ntmode, @attr, @io, FileShareMode,
 FILE_NON_DIRECTORY_FILE or FILE_SYNCHRONOUS_IO_NONALERT)
 end;



 So in core, winnt is working with UTF16. All ANSI Winapi functions map
 to these winnt calls.

 -Ivo Steinmann
   

that's the object_attributes type

  OBJECT_ATTRIBUTES = record
Length: ULONG;
RootDirectory: HANDLE;
ObjectName: PUNICODE_STRING;
Attributes: ULONG;
SecurityDescriptor: PVOID;   // Points to type SECURITY_DESCRIPTOR
SecurityQualityOfService: PVOID; // Points to type
SECURITY_QUALITY_OF_SERVICE
  end;


if fpc would use ntapi instead of winapi (maybe it do, no idea) it would
be faster, because there's no overhead at all :)  at least with new
UnicodeString type. ntapi is also quite near to functions you know as
syscalls from unix.

-Ivo
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 11:11 AM, Ivo Steinmann [EMAIL PROTECTED] wrote:

 So in core, winnt is working with UTF16. All ANSI Winapi functions map
 to these winnt calls.

So then there is already a conversion going on. From ANSI api to
UTF16 api.  I still think (and will try and put together some
benchmark app over the weekend) that the conversion from
UTF8-UTF16-API call is going to be so small that it's hardly
something to talk about. Especially with todays CPU's.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 11:17 AM, Daniël Mantione
[EMAIL PROTECTED] wrote:

 Russian, Arabic, Japanese are languages in daily use on computers, countless
 electronic documents in these languages exist.

And most documents that exist in the world are in UTF-8 format: Save
to file, HTML documents etc... :-)

 In many situations it will not be necessary to bother with
 surrogates at all. In some situations we may just accept patches if someone
 is interrested.

Oh? So what now - is FPC going to only implement UCS-2 support (like
MSEgui did).


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Jonas Maebe


On 26 Sep 2008, at 10:43, Michael Schnell wrote:

Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just  
ignoring the surrogates ?


At least the Unix widestring manager fully supports surrogates (except  
if you use the MSIDE-patched version, where it has been removed  
because it is considered as unnecessary overhead).



Jonas

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Daniël Mantione



Op Fri, 26 Sep 2008, schreef Marco van de Voort:


In our previous episode, Dani?l Mantione said:

That's highly dependant on what you application does!  If your
application primarily parses text files, it's relevant. :-)


Shortstrings  ansistrings won't go away. You'll still be able to code
fast text file parsers. Note that in such cases your application won't
process unicode, taking the numbers example again: As soon as your
application accepts arabic numbers everywhere western numbers are allowed,
you want the parsing to happen in UTF-16.


Accepting both Arabic and Westernized Arabic numerals would possibly break a
lot of code anyway, since to string and back wouldn't be reversible.


It has never been reversible. Think about val('$100',v);


actually already isn't with Delphi I know, due to hex and padding handling,
but this would be a magnitude worse)


You want to handle it transparently. Otherwise you get a mess like 
that people need all kind of ugly case constructs, having to call a 
different val routine depending on the language the program is shown in. 
That way you never will get good multi-lingual support.


For many people Unicode is just let's go UTF-8. It's far more than that 
and 100% supporting Unicode is even next to impossible.



You can't seperate val from str, and what would str(100,s) do?


It could accept an extra optional parameter for the desired script or 
something like that.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 11:31 AM, Marco van de Voort [EMAIL PROTECTED] wrote:
 Someone writing a spell checker for old-Egyptian Hieroglyphs will have to
 deal with surrogates. For those people UTF-16 has few advantages over
 UTF-8, (allthough in practice it's still a bit easier to handle than UTF-8).

 IMHO such assumptions can be made for end user businesscode. (and only if the 
 CJK
 pages above $ are ancient and not in modern use), however the RTL and
 other libraries should be simply unicode complaint. Period.

I fully agree In fact, the application developers should even be
bother with encoding types etc. All string functions and string
handling in the RTL should take care of that. [I can say that, because
I have no clue how the FPC  RTL internals works. ;-) ]


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Daniël Mantione



Op Fri, 26 Sep 2008, schreef Marco van de Voort:


In our previous episode, Dani?l Mantione said:

as I know D2009 (I think) handles this correctly, but I have no idea
how.


Let me put it like this: Someone writing a Russian/Arabic/Japanese spell
checker does not have to handle surrogates with UTF-16, but he does with
UTF-8, i.e. UTF-16 is much better for them than UTF-8.


Are you sure? There is a CJK plane above $.


Chinese yes, Japanese is fully BMP.

 Afaik these are non

simplified glyphs used for titles etc. Less than normal script, but not that
rare.


Someone writing a spell checker for old-Egyptian Hieroglyphs will have to
deal with surrogates. For those people UTF-16 has few advantages over
UTF-8, (allthough in practice it's still a bit easier to handle than UTF-8).


IMHO such assumptions can be made for end user businesscode. (and only if the 
CJK
pages above $ are ancient and not in modern use), however the RTL and
other libraries should be simply unicode complaint. Period.


Yes.

Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Dani?l Mantione said:
 
  Accepting both Arabic and Westernized Arabic numerals would possibly break a
  lot of code anyway, since to string and back wouldn't be reversible.
 
 It has never been reversible. Think about val('$100',v);

See one line further down.
 
  actually already isn't with Delphi I know, due to hex and padding handling,
  but this would be a magnitude worse)
 
 You want to handle it transparently. Otherwise you get a mess like 
 that people need all kind of ugly case constructs, having to call a 
 different val routine depending on the language the program is shown in. 
 That way you never will get good multi-lingual support.

IMHO one should separate GUI val from system val. IMHO it is a presentation
layer problem and should be dealt with there.
 
 For many people Unicode is just let's go UTF-8. It's far more than that 
 and 100% supporting Unicode is even next to impossible.

Correct, but that is what I'm suggesting. UTF-16 is not a cure all either,
only at a first superficial glance. I'm btw not for UTF-8, but for working
in the native encoding per platform.

  You can't seperate val from str, and what would str(100,s) do?
 
 It could accept an extra optional parameter for the desired script or 
 something like that.

If you think that is acceptable, you can also do it for val.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Martin Schreiber
On Friday 26 September 2008 09.34:44 Graeme Geldenhuys wrote:

 Well if you have Utf-8 versions of all basic string processing
 functions like Pos, Length, Copy, Insert etc you don't have to think
 of encoding or anything. fpGUI uses UTF-8 internally, and I never have
 to think about what encoding I'm working with. I assume Lazarus LCL is
 the same.

It seems you prefer utf-8 over utf-16 for internal string encoding in a GUI 
framework. Why?
I prefer utf-16 over utf-8 for MSEide+MSEgui because *all* current users 
(including the Chinese) can use simple string index to access the characters 
of their used languages and almost nobody can use string index to access 
characters in utf-8.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Martin Schreiber said:
  Well if you have Utf-8 versions of all basic string processing
  functions like Pos, Length, Copy, Insert etc you don't have to think
  of encoding or anything. fpGUI uses UTF-8 internally, and I never have
  to think about what encoding I'm working with. I assume Lazarus LCL is
  the same.
 
 It seems you prefer utf-8 over utf-16 for internal string encoding in a GUI 
 framework. Why?

 I prefer utf-16 over utf-8 for MSEide+MSEgui because *all* current users 
 (including the Chinese) can use simple string index to access the
 characters

See my previous discussion with Daniel. There is a CJK block over $
(afaik containing non-simplified Chinese). Moreover, with Vista there are no
special fonts or East Asia versions needed anymore to use these.

 of their used languages and almost nobody can use string index to access 
 characters in utf-8.

If you do it right, you can't with UTF-16 either. Moreover, you get a split
between the encoding used for GUI (utf-16, as forced by you), and a system
using UTF-8 on e.g. the free unices.

This was originally the reason for FPC to at least support both encodings,
UTF-8 users can for those few routines in their business code where they
must hack something character based together, simply declare those routines
with a forced UTF16 string type, and the system will autoconvert, without
the entire system having to be utf-16.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 11:46 AM, Martin Schreiber [EMAIL PROTECTED] wrote:
 It seems you prefer utf-8 over utf-16 for internal string encoding in a GUI
 framework. Why?
 I prefer utf-16 over utf-8 for MSEide+MSEgui because *all* current users
 (including the Chinese) can use simple string index to access the characters
 of their used languages and almost nobody can use string index to access
 characters in utf-8.

In my years of experience, string index access is not a requirement.
In the last four years working on our current project I have still not
had a need for string index access. It's a overrated statement as far
as I'm concerned.

UTF8CharAtByte() or UTF8Copy() if needed is fine for me.  And if you
are parsing a string, it happens sequentially anyway, so it's very
easy to track characters in a utf8 string.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Ivo Steinmann
Marco van de Voort schrieb:
  
   
 For many people Unicode is just let's go UTF-8. It's far more than that 
 and 100% supporting Unicode is even next to impossible.
 

 Correct, but that is what I'm suggesting. UTF-16 is not a cure all either,
 only at a first superficial glance. I'm btw not for UTF-8, but for working
 in the native encoding per platform.

   
I guess that would be one of the best solutions. Having a system unicode
string type and then some specialized string types.

SysString
UTF8String
UTF16String
UTF32String



Anyway, I still think something like this would be nice ;) I have got
already an implementation of such a system and I think it's not best
solution (there's no best solution) but it's not a bad one. let's see
what next delphi version brings, but my code works like this:

type
  TMapFunction = function(const Dest: pointer; const Source: Pointer):
integer;

  PEncodings = ^TEncodings;
  TEncodings = record
signsize: integer; // 1,2 or 4
encode: TMapFunction;  // encode some ucs32 string to this encoding
decode: TMapFunction;  // decode this encoding to ucs32 buffer
  end;

const
  MyOwnEncodings: TEncodings = (
Foo: 
Bar: 
  );

type
  SysString = UnicodeString[SystemEncoding]
  UTF8String = UnicodeString[UTF-8]
  UTF16String = UnicodeString[UTF-16]
  MyOwnString = UnicodeString[MyOwnEncodings]


then you can assign all specialized string types to UnicodeString, but
you can't change the encoding of UnicodeString (either it's not
changeable at all or it's locked);

TUnicodeStringRec = record
  Encoding: PEncoding;
  Locked: Boolean; // locked encoding   SetEncoding(S, someEncoding); 
is not possible
  CodeCount: Integer;   // number of signs
  RefCount: Integer;  // refcounter
  Length: Integer; // number of char
  FirstChar: Byte/Word/Longword;
end;


locked encoding is allways true after you assigned a spezialized string
to UnicodeString, eg

S1: UTF8String;
S2: UnicodeString;

S1 := 'foobar';
S2 := S1;
SetEncoding(S2, UTF16);   exception

for fast string processing, it's easy to convert a string to UCS32

S1: UTF8String;
S2: UCS32String;
P: PUCS32Char;

S2 := S1;
for i := 0 to length(S2)  - 1 do
  S2[i] := 'X';
S1 := S2;

or

P := PUCS32Char(S2);
while P^  0 do
begin
  P^ := 'X';
  Inc(P);
end;


-Ivo Steinmann
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?y

2008-09-26 Thread Marco van de Voort
In our previous episode, Martin Schreiber said:
 
 Hmm, you should ask the Russian users for example if they prefer MSEgui 
 utf-16 
 internal encoding or Lazarus utf-8.

Users always look short term, and want to change as little as possible. 

This goes both for UTF-16 (with the is UCS2 approximation and keep the old
ways of string indexing) as for UTF-8 (as superset of ansi, avoidance of
multiple file types (no endianess)).

Note that e.g. source ocde seems to go en masse in the direction of UTF-8
(Even Tiburon, which works exclusively on Windows, an UTF-16 platform, saves
source default to UTF8 afaik).

Anyway, I think a mix of UTF-8 and UTF-16 is here to stay, so better deal
with it. UTF-8 won't go away as legacy anytime soon.

It's the developers responsibility to keep an eye out for the long term
direction of a toolchain.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Ivo Steinmann said:
  in the native encoding per platform.
 

 I guess that would be one of the best solutions. Having a system unicode
 string type and then some specialized string types.
 
 SysString
 UTF8String
 UTF16String
 UTF32String
 Anyway, I still think something like this would be nice ;)

This originally was the plan. The implementation differed however between
different solutions due to problems with automated conversions.

However it turned out that Tiburon made a different choice, and chose to
keep tunicodestring UTF16 only, and map UTF-8 on ansistring (and add
codepages support to ansistring too)

Since the Tiburon system has the most important required properties, I think
it is useless to invent a different solution.

Btw, IMHO working with mixed encodings should be possible without using
procedures like setencoding, that is hidden manual string handling, which
has no place in an automated system.

IOW the system should be declaritive.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Paul Ishenin

Martin Schreiber wrote:
Hmm, you should ask the Russian users for example if they prefer MSEgui utf-16 
internal encoding or Lazarus utf-8.
  
You are mixing things a bit. People from russian forum prefere less 
bugs. And utf8 implementation of lazarus brought them alot. This is the 
difference.


And btw, I've never heard from them that they dislike utf8. Alhough they 
have problems especially with retrieving ansi data from their databases.


Best regards,
Paul Ishenin.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Graeme Geldenhuys
On Fri, Sep 26, 2008 at 12:34 PM, Marco van de Voort [EMAIL PROTECTED] wrote:
 I guess that would be one of the best solutions. Having a system unicode
 string type and then some specialized string types.

 SysString
 UTF8String
 UTF16String
 UTF32String
 Anyway, I still think something like this would be nice ;)

 This originally was the plan. The implementation differed however between
 different solutions due to problems with automated conversions.

Taking a step back from Free Pascal and Tiburon How do other
frameworks handle string encodings etc... Frameworks like Java, Qt
etc... Can't we learn something from them as well?  Both Java and Qt
run on multiple platforms, read/write to files, do string manipulation
etc  I don't know those frameworks well, but they have huge
developer base and backed by huge companies (with plenty of developers
working on those frameworks). Plus, they have been supporting Unicode
for ages already! I'm sure we can learn something from their
experience.


Regards,
  - Graeme -


___
fpGUI - a cross-platform Free Pascal GUI toolkit
http://opensoft.homeip.net/fpgui/
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Michael Schnell



Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
surrogates ?



Lets hope not, 

I don't think,  full UTF-16 really would be desirable desirable over UC-2.

Imagine you have a string of some million characters (e.g. a Book). All 
functions that need to find the n-th character (like x[n], copy, ...) 
would take forever, as they need to scan the complete string (if not 
widestring is a rather complex tree-like format).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Michael Schnell


  

need to be 32 bit if surrogates were allowed in Widestrings).


How to squeeze a value  $ in a 16 Bit value ?

Can you magically store two bits in a single hardware cell ?

-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Daniël Mantione



Op Fri, 26 Sep 2008, schreef Graeme Geldenhuys:



Taking a step back from Free Pascal and Tiburon How do other
frameworks handle string encodings etc... Frameworks like Java, Qt
etc... Can't we learn something from them as well?  Both Java and Qt
run on multiple platforms, read/write to files, do string manipulation
etc  I don't know those frameworks well, but they have huge
developer base and backed by huge companies (with plenty of developers
working on those frameworks). Plus, they have been supporting Unicode
for ages already! I'm sure we can learn something from their
experience.


Both Java  QT use UTF-16 internally.

Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?y

2008-09-26 Thread Martin Schreiber
On Friday 26 September 2008 12.30:27 Marco van de Voort wrote:
 In our previous episode, Martin Schreiber said:
  Hmm, you should ask the Russian users for example if they prefer MSEgui
  utf-16 internal encoding or Lazarus utf-8.

 Users always look short term, and want to change as little as possible.

 This goes both for UTF-16 (with the is UCS2 approximation and keep the
 old ways of string indexing) as for UTF-8 (as superset of ansi, avoidance
 of multiple file types (no endianess)).

 Note that e.g. source ocde seems to go en masse in the direction of UTF-8
 (Even Tiburon, which works exclusively on Windows, an UTF-16 platform,
 saves source default to UTF8 afaik).

As does MSEide, source code is stored in the current locale encoding or utf-8, 
the latter is preferred. MSEgui stores ini files and the like in utf-8, the 
form definition files (*.mfm, the MSEgui equivalent of Delphi *.dfm) are pure 
ASCII.
For DB access MSEgui converts from utf-8 or the current locale encoding to 
utf-16 while fetching the data from server and converts back to utf-8 or the 
locale encoding before writing data to the server. There is a switch in the 
connection and dataset components to select either utf-8 or the locale 
encoding. Strings in the dataset buffer are stored as variable length utf-16 
strings.
All this can be done with the currently available standard FPC 2.2.2 
widestring facilities. I have no problem if the FPC RTL supports the system 
encoding only, MSEgui has the commonly used interface  to the filesystem and 
other services with widestring parameters. If something is missing, it can be 
added to the MSEgui library.
But for internal character encoding where the users must work with, utf-16 is 
better suited than utf-8, I am happy that FPC will support a reference 
counted widestring type in Windows in future releases.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Michael Schnell



How do other
frameworks handle string encodings etc
With .NET/Mono I suppose you don't need to bother. But I suppose this is 
one of the reasons that strings are constants once they are assigned 
some value; and you can't so things like s[n] := 'x'.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel



Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Michael Schnell said:
  need to be 32 bit if surrogates were allowed in Widestrings).
  
 How to squeeze a value  $ in a 16 Bit value ?
 
 Can you magically store two bits in a single hardware cell ?

As said before, unicode is more than just expanding the range of characters.
The whole concept of character based parsing must be limited as much as
possible, since aside from encoding related conditions, there are also a lot
of language related issues.

IOW making an app support multiple languages is more than mapping in the
characters.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Dani?l Mantione said:
  Taking a step back from Free Pascal and Tiburon How do other
  frameworks handle string encodings etc... Frameworks like Java, Qt
  etc... Can't we learn something from them as well?  Both Java and Qt
  run on multiple platforms, read/write to files, do string manipulation
  etc  I don't know those frameworks well, but they have huge
  developer base and backed by huge companies (with plenty of developers
  working on those frameworks). Plus, they have been supporting Unicode
  for ages already! I'm sure we can learn something from their
  experience.
 
 Both Java  QT use UTF-16 internally.

Afaik Java and .NET (C#) also have the feature that for character based
access you need to use a different type (a -builder type). 

This means they can have separate internal encodings for the base string
type, and the chara based editing string types.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Michael Schnell
Nonetheless a type to hold a single character needs to exist. And same 
needs to be a 32 bit type if you want to store more than 2^16 different 
values (as possible with UTF-8 and UTF-16 but not with UCS-2.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Sergei Gorelkin

Graeme Geldenhuys wrote:


Has anybody else got sample test code that clearly shows the claimed
significant speed gain in using UTF-16 for Windows API's?  If so,
could you please post the code and your comparative results (timing
values).  I think most people perception was that ANSI API's will be
slower, but never really bothered to actually proof that it was.


Such testing is pretty much useless, because the speed of any real 
program depends on what this program is doing.
However, since I've done intensive benchmarking while developing the XML 
parser, here are some results:


Parsing an XML file with 1 million chars, 25000 elements and 18000 text 
nodes is about 10% faster for UTF-16 than for UTF-8 (despite the byte 
count is twice bigger).
At the same time, the parsing itself takes only about 30% of total time. 
The rest is spent in the memory manager. The memory manager usage 
pattern also matters: it works faster when you only allocate memory than 
when you allocate, free, then allocate again.


The speed of string conversion itself might be unnoticeable on modern 
CPUs, but remember that each conversion is at least two memory manager 
calls, plus a guarding exception frame.



Regards,
Sergei
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Marco van de Voort
In our previous episode, Michael Schnell said:
  Is UTF-16 Widestring in FPC (and Delphi 200x ? ) not done just ignoring the
  surrogates ?
 
  Lets hope not, 
 I don't think,  full UTF-16 really would be desirable desirable over UC-2.
 
 Imagine you have a string of some million characters (e.g. a Book). All 
 functions that need to find the n-th character (like x[n], copy, ...) 
 would take forever, as they need to scan the complete string (if not 
 widestring is a rather complex tree-like format).

That is a solution to isolate such code and treat it different from the
rest, not to mutilate the unicode standard.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-26 Thread Mattias Gaertner
On Fri, 26 Sep 2008 13:20:57 +0200
Michael Schnell [EMAIL PROTECTED] wrote:

 Nonetheless a type to hold a single character needs to exist. And
 same needs to be a 32 bit type if you want to store more than 2^16
 different values (as possible with UTF-8 and UTF-16 but not with
 UCS-2.

Some characters are encoded as several unicode characters. For example
a german a-umlaut is encoded under Mac OS X HFS as 2 characters =
1+2bytes in UTF-8 and 2+2bytes in UTF-16. This is not some Egyptian or
Klingon, but normal German, Finnish, French, etc. A
s[i]:='x' doesn't work in UTF-8, nor UTF-16, nor UTF-32.

In short:
A single character for all purposes can not be defined. Unicode can not
be handled as array of character.

The choice for UTF-8 or UTF-16 depends mostly on the used libraries
and compatibility. The more unicode features you want to support the
less important becomes the encoding.

The encoding can be important for speed:
For example the widestring xml parser is up to 10 times slower than
the ansistring xml parser.

Mattias
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-25 Thread Florian Klaempfl

Graeme Geldenhuys schrieb:

Hi,

Yes I know we have had lengthy discussions about this before.
Everybody (whoever they might be) keeps saying that UTF-16 was chosen
for Tiburon's UnicodeString because it makes significant speed gains
when calling the Windows API based on UTF-16 - compared to the ANSI
API's. 


Who says that? UTF-16 is simply chosen because it has features 
(supporting all characters basically) ANSI doesn't?

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] Is calling the Windows Unicode APIs really faster than the ANSI API's?

2008-09-25 Thread JoshyFun
Hello Graeme,

Thursday, September 25, 2008, 9:50:04 PM, you wrote:

GG Yes I know we have had lengthy discussions about this before.
GG Everybody (whoever they might be) keeps saying that UTF-16 was chosen
GG for Tiburon's UnicodeString because it makes significant speed gains
GG when calling the Windows API based on UTF-16 - compared to the ANSI
GG API's. The whole debate goes that you wouldn't need constant
GG conversions between ANSI-UTF-16-ANSI.  Now it seems Free Pascal
GG developers want to base their design on those results as well (yes,
GG plus the whole compatibility thing)

They are not talking about ANSI strings (they are not Unicode
compatibles) but UTF-8 strings. So the choose is UTF8 or UTF16 (or
UCS32 of course) and in Windows UTF16 is faster as do not need
conversion to call the API, zero time against any amount of time. Of
course in other OS the things changes.

-- 
Best regards,
 JoshyFun

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel