subject:"\[Lazarus\] cwstring in arm\-linux"


On 10/21/2011 10:24 PM, Hans-Peter Diettrich wrote:



The bad news about new Delphi strings is the addition (overload) of 
functions with RawByteString arguments, which *take* strings of any 
encoding, but then *ignore* that encoding. These versions certainly 
must fail for all MBCS encodings :-(



Is there an agreement on how this should work ?

function parameter type    Unicode string type coming in ... actual 
encoding ID of string coming in  - conversion


 1) RawByte ... any .. any - supposedly no, supposedly keeping the 
encoding ID

 2) not Raw ...  Raw ... $ (=RAW) - ???
 3) not Raw ... same  ... matching  - obviously No
 4) not Raw ... same  ... not matching (maybe $) - this is an 
intersexual String what to do ?

 5) not Raw ... Raw ...  matching to parameter type - supposedly No
 6) not Raw ... Raw ... not $ but not matching the parameter type 
- supposedly Yes
 7) not Raw ... not Raw but different ... not $ matching its own 
Type - supposedly Yes
 8) not Raw ... not Raw but different ... not $ matching the Type 
of the parameter - this is an intersexual String, not converting it 
would cure this.
 9) not Raw ... not Raw but different ... not $ matching neither - 
this is an intersexual String what to do ?
10) not Raw ... not Raw but different ... $- this is an intersexual 
String what to do ?


did I forget any cases ?
-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


+1 to all points.

-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/21/2011 11:48 PM, Hans-Peter Diettrich wrote:




What if a file on the user computer has 4byte [visible] character as 
8th character and you, for example want to get 8 character file name? 
In this case you split that 4 byte character and have garbage.


For me (and for Linux) a file name does not at all consist of visible 
characters but is just a sequence of bytes. (AFAIK, with Ext3 any byte 
is allowed but Zero and /).


How this byte array is presented on a screen, printed out or obtained 
from a keyboard is jut up to the program that communicates with the 
user. Thus it _might_ handle the byte string as if it would be UTF-8 
(unless it does not match the appropriate rules), as locale based ANSI 
or whatever.


For the presentation it of course needs to adhere to the API definition 
of the WidgetSet used. But this in fact does not have any relation to 
how the file system works and this what the meaning of the file name 
really is.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-24 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 10/21/2011 10:24 PM, Hans-Peter Diettrich wrote:



The bad news about new Delphi strings is the addition (overload) of 
functions with RawByteString arguments, which *take* strings of any 
encoding, but then *ignore* that encoding. These versions certainly 
must fail for all MBCS encodings :-(



Is there an agreement on how this should work ?

function parameter type    Unicode string type coming in ... actual 
encoding ID of string coming in  - conversion


I'm not sure what you mean. Whenever a UnicodeString is passed as an 
argument, the UnicodeString version is called, not the RawByteString 
version.


When only AnsiString types are passed as parameters, the RawByteString 
version is called, and has to deal with possibly different encodings. 
The Delphi implementations simply ignore any encoding, so that the 
results are almost unusable :-(


In the AnsiStrings and StrUtils units another set of overloaded 
procedures is provided, for native AnsiString(CP_ACP) arguments. These 
versions are called only for all-native AnsiString arguments, so that no 
conversions are required.



 1) RawByte ... any .. any - supposedly no, supposedly keeping the 
encoding ID

 2) not Raw ...  Raw ... $ (=RAW) - ???
 3) not Raw ... same  ... matching  - obviously No
 4) not Raw ... same  ... not matching (maybe $) - this is an 
intersexual String what to do ?

[...]

Please note that the RawByteString type is not an intended type for 
variables, only for subroutine arguments. Strings of that type are 
either empty, with no encoding, or they hold the last assigned string, 
including its encoding.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/24/2011 01:34 PM, Hans-Peter Diettrich wrote:


I'm not sure what you mean. Whenever a UnicodeString is passed as an 
argument, the UnicodeString version is called, not the RawByteString 
version.


I'm not speaking about any existing procedures, but those somebody can 
do and thus do not have overloaded versions.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/24/2011 01:34 PM, Hans-Peter Diettrich wrote:
Please note that the RawByteString type is not an intended type for 
variables, only for subroutine arguments.
I'm not speaking about wow anything is intended, but asking about a 
definition what the compiler is to do when these cases are detected at 
compile- and run-time.


If it is syntactically possible there needs to be a definition on what 
will happen.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-22 Thread Žilvinas Ledas

Hi,

On 2011-10-22 00:48, Hans-Peter Diettrich wrote:

Žilvinas Ledas schrieb:

Hello,

On 2011-10-21 10:43, Michael Schnell wrote:
Of course you are right, but move and friends is hardware-near
programming for this who know what they are doing. but basic
(legacy) string operations like myChar := myString[i] is
office-level programming and thus should work as a dummy expects.

What if a file on the user computer has 4byte [visible] character as
8th character and you, for example want to get 8 character file name?
In this case you split that 4 byte character and have garbage.

Then you (or your boss) didn't understand the meaning of 4
characters. (Logical) characters are different from physical Chars,
in every MBCS codepage.
I know that logical characters are different from physical. I was trying
to make a point, that even usint UTF16 you MUST check any string comming
trom outside world.

What it user inputs in your text field (or a command line parameter
or anywhere else) a string containing 4 byte character and you split
that string on that character? (For example when showing some kind of
summary of his input.) Don't forget that user can input characters by
copy-pasting them from the web, not only using his keyboard!

See above. With proportial fonts, counting characters is a bad idea,
instead the width of the displayed string (in pixels) should be used.
Then you also can deal with languages and character sets, which use
ligatures and the like. Even with monospaced fonts the characters
(glyphs) can have a different width, in multiples of the basic width,
e.g. for Chinese or other eastern character sets.

So, if you want to write PROFESSIONAL software with any user input -
you must handle 4 byte characters at every place you get user input.

Counting characters then is a bad idea, see above.

Otherwise you leave a chance to get and show to the user garbage. Is
this really easier than using UTF8 everywhere?

My personal experience: I am maintaining (as a hobby project)
multi-language dictionary program (a screen-shoot:
http://2.bp.blogspot.com/_3-IaodGIbVQ/TMHY-l9M4sI/Aak/AbtShWq0ZUQ/s1600/KZod_screen_win7.png

Great :-)

) and it involves quite a bit of [multilingual] string manipulation
and when I did migration from delphi to Lazarus I didn't know about
requirement that all (GUI) strings must be UTF8 and I had no problems
migrating! Yes, afterwards I tweaked some calls to RTL (mostly file
handling) functions that expected to get ANSI encoding, but this is
not a problem of UTF8, but or RTL being (mostly) ansi.

From which Delphi version did you migrate?
What encoding did you use in Delphi?

From Delphi 5.
Actually, it was quite do not remember now what I was using :) I think
it was a mix of ansi/wide/utf8 strings.

Regards,
Žilvinas Ledas

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-22 Thread Marco van de Voort

On Thu, Oct 20, 2011 at 09:55:25AM +0200, Martin Schreiber wrote:
  
  I suppose that there will be some kind of compiler switch allowing for 
  selecting whether or not to use the new string feature.
  
 That does not change the possibility that FPC compiler, RTL and FCL will be 
 less stable than now because of the greater complexity and because of the 
 possible new bugs.

There is a brand new fixes branch without those changes. Trunk has been
broken heavily before early in a cycle.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 2011-10-20 17:30, Luca Olivetti wrote:
 
 Additionally, 16 bits is enough to cover the BMP, Basic Multilingual
 Plane, which encompasses the majority of today's most widely used
 languages. Only when you get to more advanced codepoints in some of the
 far-eastern languages, or are needing to encode dead languages such as
 Egyption hieroglyphics do you need more than 16 bits.


That is such a rubbish statement! More and more information is being
added outside the Unicode's BMP. Emoticons, Science and Maths symbols,
Map Symbols (often seen in GPS applications), Music notes etc etc. So
it's not just far-eastern or dead languages any more... Using the
Supplementary Plane of Unicode will become a lot more used in the near
future. So UTF-16's usage of surrogate pairs will become more common
place. And this is where UTF-8 will shine once again, because nothing
will need changing in the programmers code - selecting a BMP or
Supplementary code point is identical. Programmers using UTF-16 often
don't bother checking for surrogate pairs, treating UTF-16 like UCS2 -
BIG MISTAKE!  This is why I think UTF-8 is a much safer choice.



Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 2011-10-21 00:20, Hans-Peter Diettrich wrote:
 your legacy code can assume that every (visible) character is a Char, in 
 an SBCS codepage, this is not different in UTF-16.

Rookie mistake!!! You forgot surrogate pairs in UTF-16. Think outside
the Unicode BMP where a visible character will be 4-bytes, thus two
UTF-16 Char values. As as I mentioned earlier, most programmers using
UTF-16 treat it like UCS2, forgetting that they need to check for
surrogate pairs too.

Now in UTF-8, this is not a problem at all. Finding a visible character
in the BMP or Supplementary Plane is a identical process, no special
checking is required. Thus making UTF-8 much easier and safer to use.

I've ported enough Delphi code to FPC + fpGUI where UTF-8 is used for
Unicode support. I fully agree with Felipe, using UTF-8 is much easier
with legacy code that UTF-16.

Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 04:34 PM, Felipe Monteiro de Carvalho wrote:


Length does not give the number of chars? No problem:

As said: Of course this is no problem for those who do are aware that 
they are dealing with Unicode and not with displayed characters. This of 
course includes myself when doing new code from scratch. But this does 
not include the persons I mentioned before and it does not include me 
when porting legacy code.


-Michael



--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 09:42 PM, Felipe Monteiro de Carvalho wrote:


Changing the size of Char is not just small detail, this breaks *a
lot* of code. Any kind of memory operations such as Move will fail
because the char size changed.
Of course you are right, but move and friends is hardware-near 
programming for this who know what they are doing. but basic (legacy) 
string operations like myChar := myString[i] is office-level 
programming and thus should work as a dummy expects.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/21/2011 09:03 AM, Graeme Geldenhuys wrote:

You forgot surrogate pairs in UTF-16. Think outside
the Unicode BMP where a visible character will be 4-bytes, thus two
UTF-16 Char values.
Regarding this, there seemingly is no help at all :( (I understand that 
even in full 32 Unicode there are such pairs, creating characters that 
are even maybe defined as completely different 32 Bit Unicode as well)


But in fact I up til now never came across any situation requiring 
non-BMP encoding.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/21/2011 08:56 AM, Graeme Geldenhuys wrote:


That is such a rubbish statement! More and more information is being
added outside the Unicode's BMP. Emoticons, Science and Maths symbols,
Map Symbols (often seen in GPS applications), Music notes etc etc.

Those who deal with this of course need to know what they are doing.

But those who use normal non-English (ASCII) languages (I understand 
that this is what BMP means) and don't want to deal with such things, 
are (maybe unnecessarily ) forced into hell by the UTF-8 in ANSIString 
paradigm .


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 05:49 PM, Mattias Gaertner wrote:


Often they say: Linux has problems with unicode. Reason: teachers 
think that unicode is so simple under java, so they don't explain it.
I see. Obviously a similar problem as with Delphi. (If E. in fact (like 
promised some time ago) creates a Delphi that compiles for Linux, I am 
curious how this is handled.)


If you have students that stupid, then don't tell them about the [] 
operator.



:) :) :)

-Michael
--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 10:26 PM, Felipe Monteiro de Carvalho wrote:

Mac OS X uses the decomposed form in UTF-8 to store filenames, which
is rather unpleasant.


Why are they so silly ?

-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Graeme Geldenhuys schrieb:

On 2011-10-21 00:20, Hans-Peter Diettrich wrote:
your legacy code can assume that every (visible) character is a Char, in 
an SBCS codepage, this is not different in UTF-16.


Rookie mistake!!! You forgot surrogate pairs in UTF-16.


Which Ansi characters translate into surrogate pairs?



Now in UTF-8, this is not a problem at all. Finding a visible character
in the BMP or Supplementary Plane is a identical process, no special
checking is required. Thus making UTF-8 much easier and safer to use.


Please specify Finding, a code snippet would be nice.



I've ported enough Delphi code to FPC + fpGUI where UTF-8 is used for
Unicode support. I fully agree with Felipe, using UTF-8 is much easier
with legacy code that UTF-16.


This only demonstrates that UTF-16 has not been supported sufficiently 
in FPC, until now. Give an example of UTF-8 code, which would become 
*more* complicated with UTF-16.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Graeme Geldenhuys schrieb:

On 2011-10-20 17:30, Luca Olivetti wrote:

Additionally, 16 bits is enough to cover the BMP, Basic Multilingual
Plane, which encompasses the majority of today's most widely used
languages. Only when you get to more advanced codepoints in some of the
far-eastern languages, or are needing to encode dead languages such as
Egyption hieroglyphics do you need more than 16 bits.



That is such a rubbish statement! More and more information is being
added outside the Unicode's BMP. Emoticons, Science and Maths symbols,
Map Symbols (often seen in GPS applications), Music notes etc etc.


Now also tell us how application code is affected by such astral 
codepoints, and how these are handled easier in UTF-8 than in UTF-16.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 2011-10-21 09:50, Michael Schnell wrote:
 
 But in fact I up til now never came across any situation requiring 
 non-BMP encoding.

We use the Science and Maths symbols define outside the BMP all the time
in our products. Why use images (old school style) when font symbols
(today's style) can do the exact same think, but easier. Plus you get
the benefit that you can copy  paste text with math or science symbols
without problems.


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 2011-10-21 10:19, Hans-Peter Diettrich wrote:
 
 Please specify Finding, a code snippet would be nice.

Knock yourself out...


https://github.com/graemeg/fpGUI/blob/master/src/corelib/fpg_stringutils.pas


Take a look at UTF8Copy() or UTF8Insert() etc.


 in FPC, until now. Give an example of UTF-8 code, which would become 
 *more* complicated with UTF-16.

Consider a Copy() type function where you want to copy a Unicode
codepoint (think single character as you see on the screen - ignoring
combining diacritics for now) out from a string. UTF8Copy() as defined
above will do that correctly, irrespective if the codepoint is in the
BMP or Supplementary Plane or if the character is represented by 1,2,3
or 4 bytes in length.

With UTF-16 you need to check if the UTF-16 string is Little Indian or
Big Indian (UTF-16BE or UTF-16LE), whether the codepoint has a surrogate
pair or not. All in all, a lot more complex than UTF-8.


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/21/2011 10:09 AM, Graeme Geldenhuys wrote:


We use the Science and Maths symbols define outside the BMP all the time
in our products.
So with these projects you obviously are a Unicode aware programmer 
and don't qualify for the group of office programmers that (IMHO) 
should be enabled to do theirs stuff without being forced to think about 
displayable character  encoding stuff.


-Michael


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 2011-10-21 10:31, Michael Schnell wrote:
 So with these projects you obviously are a Unicode aware programmer 
 and don't qualify for the group of office programmers that (IMHO) 


I don't have to think about anything special when working with Unicode
text. I simply use the string manipulation function as defined in the
fpGUI framework, and not the ones defined in FPC's RTL. The fpGUI
framework handles the rest for me.

I'm even considering renaming all the UTF8xxx string manipulation
functions in fpGUI to something like fpg  (eg: fpgCopy() or
fpgInsert()) because I really don't think the programmer needs to know
that fpGUI uses UTF-8 internally. If you use the string functions as
defined in fpGUI, your code will work.


Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 2011-10-21 10:22, Hans-Peter Diettrich wrote:
 
 Now also tell us how application code is affected by such astral 
 codepoints, and how these are handled easier in UTF-8 than in UTF-16.


As to not repeat myself, see one of my other replies.



Regards,
  - Graeme -

-- 
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-21 Thread Žilvinas Ledas


Hello,

On 2011-10-21 10:43, Michael Schnell wrote:
Of course you are right, but move and friends is hardware-near 
programming for this who know what they are doing. but basic (legacy) 
string operations like myChar := myString[i] is office-level 
programming and thus should work as a dummy expects.




What if a file on the user computer has 4byte [visible] character as 8th 
character and you, for example want to get 8 character file name? In 
this case you split that 4 byte character and have garbage.
What it user inputs in your text field (or a command line parameter or 
anywhere else) a string containing 4 byte character and you split that 
string on that character? (For example when showing some kind of summary 
of his input.) Don't forget that user can input characters by 
copy-pasting them from the web, not only using his keyboard!


So, if you want to write PROFESSIONAL software with any user input - you 
must handle 4 byte characters at every place you get user input. 
Otherwise you leave a chance to get and show to the user garbage. Is 
this really easier than using UTF8 everywhere?


My personal experience: I am maintaining (as a hobby project) 
multi-language dictionary program (a screen-shoot: 
http://2.bp.blogspot.com/_3-IaodGIbVQ/TMHY-l9M4sI/Aak/AbtShWq0ZUQ/s1600/KZod_screen_win7.png 
) and it involves quite a bit of [multilingual] string manipulation and 
when I did migration from delphi to Lazarus I didn't know about 
requirement that all (GUI) strings must be UTF8 and I had no problems 
migrating! Yes, afterwards I tweaked some calls to RTL (mostly file 
handling) functions that expected to get ANSI encoding, but this is not 
a problem of UTF8, but or RTL being (mostly) ansi.



Regards,
Žilvinas Ledas

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/21/2011 02:18 PM, Žilvinas Ledas wrote:


What if a file on the user computer has...
If you deal with the content of files you did not write yourself, you of 
course need to deal with whatever encoding same has been done in (maybe 
its EBCDIC :) ). This is unavoidable and if you are so unhappy that you 
need to consider that the file is done in Unicode, you of course need to 
upgrade to being a Unicode expert.


But if you just deal with the user's GUI input and output and with files 
that you wrote yourself in some default encoding code the language tools 
define, IMHO a decent language should do whatever possible to hide the 
complexity.


As said: I'm not sure to what extent this is possible and whether Delphi 
does a good job here, so I don't intend to question any decision done by 
the Lazarus team now (for FPC without new strings) or in future with 
FPC with whatever implementation of a new string feature).


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-21 Thread Michael Lutz

Am 21.10.2011 15:02 schrieb Michael Schnell:
 But if you just deal with the user's GUI input and output and with files 
 that you wrote yourself in some default encoding code the language tools 
 define, IMHO a decent language should do whatever possible to hide the 
 complexity.

You'd advocate for fpc/Lazarus to normalize all incoming and outgoing file
names then? If you write a file with a file name in unicode NFC in OS X
and read the file name back from the OS, you'll get a NFD string returned,
which means a normalization-unaware compare function will not do what
you'd expect.


Michael Lutz


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-21 Thread Michael Lutz

Am 21.10.2011 10:00 schrieb Michael Schnell:
 On 10/20/2011 10:26 PM, Felipe Monteiro de Carvalho wrote:
 Mac OS X uses the decomposed form in UTF-8 to store filenames, which
 is rather unpleasant.
 
 Why are they so silly ?

What's silly about that? If they'd store it in precomposed form (NFC)
instead, you still can't use a simple string compare unless you normalize
all strings. And even better, some characters used in some languages
simply have no precomposed form at all, which means you'll always have to
be prepared to handle characters composed of several Unicode code points.

Michael Lutz


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/21/2011 04:13 PM, Michael Lutz wrote:

If you write a file with a file name in unicode NFC in OS X
and read the file name back from the OS, you'll get a NFD string returned,
which means a normalization-unaware compare function will not do what
you'd expect.

...as already mentioned in another message in this thread OSX does 
really silly stuff regarding file names.


How does their object pascal handle this ?

-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-21 Thread Mattias Gaertner





Michael Schnell mschn...@lumino.de hat am 21. Oktober 2011 um 17:23
geschrieben:

 On 10/21/2011 04:13 PM, Michael Lutz wrote:
  If you write a file with a file name in unicode NFC in OS X
  and read the file name back from the OS, you'll get a NFD string returned,
  which means a normalization-unaware compare function will not do what
  you'd expect.
 
 ...as already mentioned in another message in this thread OSX does
 really silly stuff regarding file names. 
The normalization is not silly.  
The really silly thing is that by default their file system is case insensitive,
while many command line tools are not. 
 
 

 How does their object pascal handle this ?

I don't know what they did, but I know from Lazarus:It's not a big deal. Lazarus
works since years on OS X and only needed a function to compare file names,
which calls the OS X function. Of course Lazarus already supported Linux. It can
be hard to port a Windows application to OS X. 



Mattias--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Graeme Geldenhuys schrieb:

On 2011-10-21 10:31, Michael Schnell wrote:
So with these projects you obviously are a Unicode aware programmer 
and don't qualify for the group of office programmers that (IMHO) 



I don't have to think about anything special when working with Unicode
text. I simply use the string manipulation function as defined in the
fpGUI framework, and not the ones defined in FPC's RTL. The fpGUI
framework handles the rest for me.


Did you ever have a look at Delphi StrUtils.pas?
Why reinvent the wheel, when functions already exist for some purpose?

The Ansi prefix can be removed in an environment where the encoding of 
the string arguments is known (e.g. fixed to UTF-8).


The bad news about new Delphi strings is the addition (overload) of 
functions with RawByteString arguments, which *take* strings of any 
encoding, but then *ignore* that encoding. These versions certainly must 
fail for all MBCS encodings :-(


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Graeme Geldenhuys schrieb:

On 2011-10-21 10:19, Hans-Peter Diettrich wrote:

Please specify Finding, a code snippet would be nice.


Knock yourself out...


https://github.com/graemeg/fpGUI/blob/master/src/corelib/fpg_stringutils.pas


Take a look at UTF8Copy() or UTF8Insert() etc.


I didn't mean the implementation, but the *task* to perform in 
application code.



in FPC, until now. Give an example of UTF-8 code, which would become 
*more* complicated with UTF-16.


Consider a Copy() type function where you want to copy a Unicode
codepoint (think single character as you see on the screen - ignoring
combining diacritics for now) out from a string.


Again, *why* would you ever want to do that? It sounds to me like 
extracting bits from floating point values :-(



UTF8Copy() as defined
above will do that correctly, irrespective if the codepoint is in the
BMP or Supplementary Plane or if the character is represented by 1,2,3
or 4 bytes in length.


Why restrict such a function to UTF-8? For working with *logical* 
characters a set of functions is needed, that do not rely on character 
indices. A StartIndex parameter IMO indicates bad design :-(
The functions can be easily overloaded to work with AnsiChar and 
WideChar string arguments, or even UCS4Char, if you like.



With UTF-16 you need to check if the UTF-16 string is Little Indian or
Big Indian (UTF-16BE or UTF-16LE),


This has to be done only on input from an file, where the encoding 
should be converted into the internal representation for every external 
encoding.


BTW, its Endian, not Indian nor Chinese ;-)



whether the codepoint has a surrogate
pair or not. All in all, a lot more complex than UTF-8.


Sorry, UTF-8 and UTF-16 only provide different encodings for the same 
Unicode codepoints. Mixing Char and Codepoint indices and counts never 
is a good idea. With that in mind it's no problem to perform the same 
task on any encoding.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Michael Schnell schrieb:

On 10/20/2011 09:42 PM, Felipe Monteiro de Carvalho wrote:


Changing the size of Char is not just small detail, this breaks *a
lot* of code. Any kind of memory operations such as Move will fail
because the char size changed.
Of course you are right, but move and friends is hardware-near 
programming for this who know what they are doing. but basic (legacy) 
string operations like myChar := myString[i] is office-level 
programming and thus should work as a dummy expects.


Simple solution: use UTF-32 encoding :-)

It's only a matter of optimization: save memory with more compressed 
encodings, or save coding time?


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Michael Lutz schrieb:

Am 21.10.2011 15:02 schrieb Michael Schnell:
But if you just deal with the user's GUI input and output and with files 
that you wrote yourself in some default encoding code the language tools 
define, IMHO a decent language should do whatever possible to hide the 
complexity.


You'd advocate for fpc/Lazarus to normalize all incoming and outgoing file
names then?


Please distinguish between file names and content. Filenames are subject 
to platform conventions, with e.g. case sensitivity and directory 
separators. File content and encoding instead is fully up to the creator.


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

Žilvinas Ledas schrieb:

Hello,

On 2011-10-21 10:43, Michael Schnell wrote:
Of course you are right, but move and friends is hardware-near
programming for this who know what they are doing. but basic (legacy)
string operations like myChar := myString[i] is office-level
programming and thus should work as a dummy expects.

What if a file on the user computer has 4byte [visible] character as 8th
character and you, for example want to get 8 character file name? In
this case you split that 4 byte character and have garbage.

Then you (or your boss) didn't understand the meaning of 4 characters.
(Logical) characters are different from physical Chars, in every MBCS
codepage.

What it user inputs in your text field (or a command line parameter or
anywhere else) a string containing 4 byte character and you split that
string on that character? (For example when showing some kind of summary
of his input.) Don't forget that user can input characters by
copy-pasting them from the web, not only using his keyboard!

So, if you want to write PROFESSIONAL software with any user input - you
must handle 4 byte characters at every place you get user input.

Counting characters then is a bad idea, see above.

Otherwise you leave a chance to get and show to the user garbage. Is
this really easier than using UTF8 everywhere?

Great :-)

) and it involves quite a bit of [multilingual] string manipulation and
when I did migration from delphi to Lazarus I didn't know about
requirement that all (GUI) strings must be UTF8 and I had no problems
migrating! Yes, afterwards I tweaked some calls to RTL (mostly file
handling) functions that expected to get ANSI encoding, but this is not
a problem of UTF8, but or RTL being (mostly) ansi.

From which Delphi version did you migrate?
What encoding did you use in Delphi?

DoDi

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Graeme Geldenhuys schrieb:

On 2011-10-20 17:30, Luca Olivetti wrote:

Additionally, 16 bits is enough to cover the BMP, Basic Multilingual
Plane, which encompasses the majority of today's most widely used
languages. Only when you get to more advanced codepoints in some of the
far-eastern languages, or are needing to encode dead languages such as
Egyption hieroglyphics do you need more than 16 bits.



That is such a rubbish statement! More and more information is being
added outside the Unicode's BMP. Emoticons, Science and Maths symbols,
Map Symbols (often seen in GPS applications), Music notes etc etc.


What do you have in mind, what your code would do with e.g. music notes? 
Would it ever try to convert these into upper case, or to substitute 
parts of such strings by text???


You can get rubbish more easily, by random substitution of English words 
by German or Chinese ones...


Or by parsing C code with a Pascal parser...

DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


Graeme Geldenhuys schrieb:

On 2011-10-21 10:22, Hans-Peter Diettrich wrote:
Now also tell us how application code is affected by such astral 
codepoints, and how these are handled easier in UTF-8 than in UTF-16.



As to not repeat myself, see one of my other replies.


I didn't ask for a repetition of *what* you did, but for a serious 
reason *why* you want to do what.


You don't need Unicode for writing bogus applications, it only helps to 
demonstrate how stupid some ideas are ;-)


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Wednesday 19 October 2011 22.05:09 Vincent Snijders wrote:
 2011/10/19 Michael Van Canneyt mich...@freepascal.org:
  On Wed, 19 Oct 2011, Felipe Monteiro de Carvalho wrote:
  On 10/19/11, Vincent Snijders vincent.snijd...@gmail.com wrote:
  I guess Felipe gave up waiting on a Unicode RTL for the time being and
  goes for a full UTF8 pseudo RTL in LazUtils.
  
  But this does not mean that LazUtils would not be useful then. My
  proposals to add UTF-8 routines to the RTL and even FCL were rejected,
  
  Correction: Your proposals were not rejected.
 
 Thanks for the clarification.
 
  No decision as to which character sets will be used in the basic RTL has
  been taken. Any action you take now is therefor premature.
  
  So it was suggested you would wait till things settle down till and the
  final shape of things are more clear.
 
 That is why I said: gave up waiting
 
I think it would have been better if Lazarus had made an RTL optimized for 
Lazarus long time ago. Now the FPC team destroyed a stable product in favor of  
Delphi string compatibility. The introduction of multi-encoding strings is not 
a really good idea IMHO but more a marketing gag.
It seems that the Delphi architects are not absolutely happy with it either. 
Allan Bauer in:

https://forums.codegear.com/message.jspa?messageID=400258#400258


We have way, way too many different string types. It's confusing.

There are more interesting statements in that thread, ex.:

https://forums.codegear.com/message.jspa?messageID=399964#399964

It is even possible that Delphi strings change again...

Martin

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 09:18 AM, Martin Schreiber wrote:

Now the FPC team destroyed a stable product in favor of
Delphi string compatibility.
I don't consider the current state of Lazarus stable, as IMHO the 
UTF8 in type ANSIString paradigm (seemingly forced by the underlying 
FPC version) is too special to stay.


As I don't have a Delphi  2009, I have no idea if the new Delphi way 
to handle Unicode is desirable or even better at all. So this is not 
meant as a criticism whatsoever.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Thursday 20 October 2011 09.32:05 you wrote:
 On 10/20/2011 09:18 AM, Martin Schreiber wrote:
  Now the FPC team destroyed a stable product in favor of
  Delphi string compatibility.
 
 I don't consider the current state of Lazarus stable,

I don't refer to Lazarus but to Free Pascal compiler, RTL and FCL.

Martin


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 09:43 AM, Martin Schreiber wrote:

I don't refer to Lazarus but to Free pascal compiler, RTL and FCL.

I suppose that there will be some kind of compiler switch allowing for 
selecting whether or not to use the new string feature.


-Michael


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

 On 10/20/2011 09:43 AM, Martin Schreiber wrote:
  I don't refer to Lazarus but to Free pascal compiler, RTL and FCL.
 
 I suppose that there will be some kind of compiler switch allowing for 
 selecting whether or not to use the new string feature.
 
That does not change the possibility that FPC compiler, RTL and FCL will be 
less stable than now because of the greater complexity and because of the 
possible new bugs.

Martin


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

Martin Schreiber wrote:

 On 10/20/2011 09:43 AM, Martin Schreiber wrote:
  I don't refer to Lazarus but to Free pascal compiler, RTL and FCL.
 
 I suppose that there will be some kind of compiler switch allowing for
 selecting whether or not to use the new string feature.
 
 That does not change the possibility that FPC compiler, RTL and FCL will
 be less stable than now because of the greater complexity and because of
 the possible new bugs.
 
Hmm, now is wrong, should be before the cpstrnew merge...

Martin


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-20 Thread Žilvinas Ledas


Hi,

On 2011-10-20 00:03, Felipe Monteiro de Carvalho wrote:

Hello,

2011/10/19 Žilvinas Ledaszilvinas.le...@dict.lt:

I am native Lithuanian so I think can help at least providing info, but I
must understand what is the problem first.

I am mostly interested in LowerCase / UpperCase. Could you explain how
it works in Lithuanian and provide test cases for it?

Test cases should be in this format:

   AssertStringOperationUTF8LowerCase('Unicode 0460 UTF8LowerCase', '',
'ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯ', 'ѡѡѣѣѥѥѧѧѩѩѫѫѭѭѯѯ');

Even better if they are in patches to the file
lazarus/tests/lazutils/testunicode.pas

First param is the label, the second the locale (in this case maybe
something like 'lt', what is the ISO identifier for lituanian? Then
UpperCase and then LowerCase.

And try to make some tricky tests, to defeat partial implementations.
3 tests for lowercase and 3 for uppercase should be enough
This should be easy. This week I have a lot of things to do but I'll try 
to look into this next week!



Regards,
Žilvinas Ledas

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-20 Thread Hans-Peter Diettrich


Michael Schnell schrieb:

On 10/20/2011 09:18 AM, Martin Schreiber wrote:

Now the FPC team destroyed a stable product in favor of
Delphi string compatibility.
I don't consider the current state of Lazarus stable, as IMHO the 
UTF8 in type ANSIString paradigm (seemingly forced by the underlying 
FPC version) is too special to stay.


It's sufficient to agree that all (displayed) strings in the LCL contain 
UTF-8 text, regardless of their type name (string types currently are 
alias).


As I don't have a Delphi  2009, I have no idea if the new Delphi way 
to handle Unicode is desirable or even better at all. So this is not 
meant as a criticism whatsoever.


Delphi allows for an UTF8String type, but this one is (or has to be) 
converted into UnicodeString all the time. The Delphi RTL only supports 
UnicodeString (UTF-16) and native AnsiStrings (of CP_ACP), all other 
encodings are not really supported, except for UTF-16 conversion. MBCS 
are supported only as far eastern DBCS, not for UTF-8 (I wonder what a 
Linux version will bring).


Functions with more than one (Ansi)String argument deserve special care, 
the *user* is responsible to only supply strings of the same encoding, 
or has to force the use of the UnicodeString versions by e.g. typecasts.


I.e. it's highly discouraged to use any but CP_ACP or UTF-16 strings, 
except for corner cases (file I/O...).



When FPC follows the new Delphi model, the LCL has to be ported to all 
strings containing UTF-16 - everything else will not work properly or 
causes many implicit conversions. This may require some work, and 
results in two incompatible versions (legacy Ansi/UTF-8 and new 
Unicode/UTF-16), and will not please all Linux (POSIX) users. IMO Linux 
is not a problem with the LCL, since the currently required UTF-8/16 
conversions with external function calls are neglectable (on my Windows 
system).


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 01:55 PM, Hans-Peter Diettrich wrote:


It's sufficient to agree that all (displayed) strings in the LCL 
contain UTF-8 text, regardless of their type name (string types 
currently are alias).
And thus functions like pos(), length() and myString[i] work on UTF-8 
code bytes rather than on (displayed) characters.


Agreeable for thought who know just use ASCII (no Germans, ...) and 
though who have a decent knowledge on Unicode.


All others are fooled.

-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 01:55 PM, Hans-Peter Diettrich wrote:



Functions with more than one (Ansi)String argument deserve special 
care, the *user* is responsible to only supply strings of the same 
encoding,
Very funny ! They invent a dynamically encoded string type that has the 
power to trigger conversions when necessary and abuse it.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 01:55 PM, Hans-Peter Diettrich wrote:


It's sufficient to agree that all (displayed) strings in the LCL
contain UTF-8 text, regardless of their type name (string types
currently are alias).


Plus: a char is not a displayable character but only can hold an UTF-8 
code byte.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Thu, Oct 20, 2011 at 2:54 PM, Michael Schnell mschn...@lumino.de wrote:
 All others are fooled.

No, they aren't. Please stop repeating this. I think you have already
sent 10 messages in various threads saying that people are naturally
unable to use UTF-8.

This is not true. My students from the 2nd year of engineering learned
alone how to use UTF-8 properly. People asking questions on the forum
learned how to use it. I so far I have seen no single person which has
such a problem with UTF-8 that it is mentally blocked from learning
it, even while this same person can use UCS-2 fluently.

This is all just fiction. Real experience shows that people can easily
learn to use UTF-8.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 02:55 PM, Felipe Monteiro de Carvalho wrote:

On Thu, Oct 20, 2011 at 2:54 PM, Michael Schnellmschn...@lumino.de  wrote:

All others are fooled.

This is not true. My students from the 2nd year of engineering learned
alone how to use UTF-8 properly.


That is exactly what I meant to say. Those who do learn how to deal with 
Unicode might be very happy to keep in mind the Unicode encoding with 
all string operations.


And if your opinion is that everybody, who wants to program with 
Lazarus, is happy when he also learns the ways of Unicode, I will not 
contradict.


But IMHO Lazarus should be (at least) as easy to use as Java and friends 
and not provide additional traps for the Unicode-illiterates.


(I once proposed to drop the support for myString[i] or for the char 
type altogether to prevent some of these traps, but supposedly this is a 
silly idea.)


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Thu, Oct 20, 2011 at 3:10 PM, Michael Schnell mschn...@lumino.de wrote:
 But IMHO Lazarus should be (at least) as easy to use as Java and friends and
 not provide additional traps for the Unicode-illiterates.

This argumentation is ridiculous, string usage in Java has tons of
traps. To start with it is a special class type with special handling
and immutable o.O Then you have no var parameters to pass them around
(and even if you had, they are immutable)...etc

But let's go back to real arguments: Do you have any big applications
written in Lazarus? If you had, how happy would you be having to
convert them from utf-8 to utf-16?

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux


On 10/20/2011 03:45 PM, Felipe Monteiro de Carvalho wrote:

Do you have any big applications
written in Lazarus? If you had, how happy would you be having to
convert them from utf-8 to utf-16?

I never voted for a UTF-16 - Lazarus.

In fact I have lots of applications done in Delphi   2009 and thus not 
Unicode aware at all but using just 8 bit fixed locale ANSI encoded 
strings.


I was happily converting some of those to the pre-Unicode of Lazarus 
(for having them run on Linux).


I was not at all happy trying to convert them to the always UTF-8 
version of Lazarus.


I am sure that I will not be more happy trying to convert them to a 
hypothetical always UTF-16 version of Lazarus.


I have no idea how happy I would be trying to convert them to a Unicode 
aware Delphi version  2009.


And of course I have no idea at all how happy I would be trying to 
convert them to an upcoming new Delphi String version of Lazarus.


-Michael

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Thu, Oct 20, 2011 at 4:19 PM, Michael Schnell mschn...@lumino.de wrote:
 I was not at all happy trying to convert them to the always UTF-8 version
 of Lazarus.

Well, here is the surprise now: utf-8 was chosen exactly to facilitate
porting old applications while still supporting all of the Unicode
standard.

Length does not give the number of chars? No problem:

uses lazutf8;
function UTF8Length(const s: string): PtrInt;

How to iterate through chars? Using this:

function UTF8CharacterLength(p: PChar): integer;

How to find the n-th char?

// find the n-th UTF8 character, ignoring BIDI
function UTF8CharStart(UTF8Str: PChar; Len, CharIndex: PtrInt): PChar;
// find the byte index of the n-th UTF8 character, ignoring BIDI (byte
len of substr)
function UTF8CharToByteIndex(UTF8Str: PChar; Len, CharIndex: PtrInt): PtrInt;

Having problems in Pos, Copy, Delete or Insert? Just replace with these:

function UTF8Pos(const SearchForText, SearchInText: string): PtrInt;
function UTF8Copy(const s: string; StartCharIndex, CharCount: PtrInt): string;
procedure UTF8Delete(var s: String; StartCharIndex, CharCount: PtrInt);
procedure UTF8Insert(const source: String; var s: string;
StartCharIndex: PtrInt);

The switch is really easy. There are routines which are equivalent to
all operations done previously.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-20 Thread Luca Olivetti


Al 20/10/2011 16:34, En/na Felipe Monteiro de Carvalho ha escrit:


The switch is really easy. There are routines which are equivalent to
all operations done previously.


But you have to manually (or semi-automatically) do it, which is a lot 
of work and possibly error prone.
While, with utf-16, you shouldn't change any routine name at all, unless 
you have to deal with characters outside the BMP.

According to this message
https://forums.codegear.com/message.jspa?messageID=399964#399964

Additionally, 16 bits is enough to cover the BMP, Basic Multilingual
Plane, which encompasses the majority of today's most widely used
languages. Only when you get to more advanced codepoints in some of the
far-eastern languages, or are needing to encode dead languages such as
Egyption hieroglyphics do you need more than 16 bits.

Bye
--
Luca Olivetti
Wetron Automation Technology http://www.wetron.es
Tel. +34 935883004 (Ext.133)  Fax +34 935883007

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-20 Thread Mattias Gaertner





Michael Schnell mschn...@lumino.de hat am 20. Oktober 2011 um 15:10
geschrieben:

 On 10/20/2011 02:55 PM, Felipe Monteiro de Carvalho wrote:
  On Thu, Oct 20, 2011 at 2:54 PM, Michael Schnellmschn...@lumino.de  wrote:
  All others are fooled.
  This is not true. My students from the 2nd year of engineering learned
  alone how to use UTF-8 properly.

 That is exactly what I meant to say. Those who do learn how to deal with
 Unicode might be very happy to keep in mind the Unicode encoding with
 all string operations. 
 
 


 And if your opinion is that everybody, who wants to program with
 Lazarus, is happy when he also learns the ways of Unicode, I will not
 contradict.

 But IMHO Lazarus should be (at least) as easy to use as Java and friends
 and not provide additional traps for the Unicode-illiterates. 
What Java do you have in mind?
Last time I used Oracle/Sun Java it still used 2byte char and you need to set
the compiler/IDE to UTF8, otherwise your source code is not portable. We have a
lot of students writing java programs under Windows, then wondering why their
programs create garbage under Linux. Often they say: Linux has problems with
unicode. Reason: teachers think that unicode is so simple under java, so they
don't explain it.
 
 


 (I once proposed to drop the support for myString[i] or for the char
 type altogether to prevent some of these traps, but supposedly this is a
 silly idea.)

If you have students that stupid, then don't tell them about the [] operator.

Mattias--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Thu, Oct 20, 2011 at 5:30 PM, Luca Olivetti l...@wetron.es wrote:
 But you have to manually (or semi-automatically) do it, which is a lot of
 work and possibly error prone.
 While, with utf-16, you shouldn't change any routine name at all, unless you
 have to deal with characters outside the BMP.

You say so, but while I cannot comment with certainty since I have
never used the Unicode Delphi, from what I read people had major
difficulties doing the migration. It was not at all easy. While for me
the migration from ansi to utf-8 was trivially easy.

Changing the size of Char is not just small detail, this breaks *a
lot* of code. Any kind of memory operations such as Move will fail
because the char size changed.

Not to mention people that were using PChar to address memory which is
not really a string =D suddenly the steps duplicate in size and your
whole memory layout changes...

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Thu, Oct 20, 2011 at 2:54 PM, Michael Schnell mschn...@lumino.de wrote:
 And thus functions like pos(), length() and myString[i] work on UTF-8 code
 bytes rather than on (displayed) characters.

Characters can be composed by separate codepoints for accent +
character (so at least 4 bytes in UTF-16). So if you write code which
depends on [] indexing characters your code will fail miserably in
this case.

Mac OS X uses the decomposed form in UTF-8 to store filenames, which
is rather unpleasant. If you convert this to UTF-16 for further work
the text will not magically get composed, although one could pass it
through a composing pre-processor.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-20 Thread Hans-Peter Diettrich


Felipe Monteiro de Carvalho schrieb:

On Thu, Oct 20, 2011 at 5:30 PM, Luca Olivetti l...@wetron.es wrote:

But you have to manually (or semi-automatically) do it, which is a lot of
work and possibly error prone.
While, with utf-16, you shouldn't change any routine name at all, unless you
have to deal with characters outside the BMP.


You say so, but while I cannot comment with certainty since I have
never used the Unicode Delphi, from what I read people had major
difficulties doing the migration. It was not at all easy. While for me
the migration from ansi to utf-8 was trivially easy.


The Ansi/UTF-16 migration is much easier than a migration to UTF-8. When 
your legacy code can assume that every (visible) character is a Char, in 
an SBCS codepage, this is not different in UTF-16. But the same is not 
true for Ansi SBCS codepages whose characters can translate into 
multi-byte sequences in UTF-8.




Changing the size of Char is not just small detail, this breaks *a
lot* of code. Any kind of memory operations such as Move will fail
because the char size changed.


Why would *application* code ever do low-level fiddling with *managed* 
strings???



Not to mention people that were using PChar to address memory which is
not really a string =D suddenly the steps duplicate in size and your
whole memory layout changes...


Then replace all occurences of String by AnsiString, and Char by 
AnsiChar (global findreplace). And replace all (eventual) usages of 
UTF8String by AnsiString, to prevent possible encoding conversions. Then 
all your code should work as before. Problems may arise from standard 
text components (TStrings...), when these are not also available in Ansi 
versions - but this only affects the runtime, due to implicit 
conversions. This is where the RTL and FCL deserve some more 
considerations, and the future will tell...


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Mon, Oct 3, 2011 at 4:35 PM, Henry Vermaak henry.verm...@gmail.com wrote:
 That's good news, thanks!

Hello, Could you test the very latest Pascal Widestring Manager? Just
disable cwstring and then add paswstring as the first unit in your
projects uses clause.

The Pascal Widestring Manager is completed, but it needs more testing =)

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Marco van de Voort

On Mon, Oct 03, 2011 at 04:31:20PM +0200, Felipe Monteiro de Carvalho wrote:
 Ok, I changed the define in rev 32655.
 
 But you should note that when paswstring gets finished it will phase
 out cwstrings.

Not that I know. And btw, I also use arm-linux without android, so please
keep that target intact and aligned with normal linux ports.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Wed, Oct 19, 2011 at 12:06 PM, Marco van de Voort mar...@stack.nl wrote:
 Not that I know. And btw, I also use arm-linux without android, so please
 keep that target intact and aligned with normal linux ports.

What is the difference between using cwstring and paswstring? Any
reason for not wanting to use paswstring?

They should be 100% equal, except that one does not require any
external libraries. If you can test and check if there are any
differences of course would be excelent =)

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Martin Schreiber

On Wednesday 19 October 2011 13.14:50 Felipe Monteiro de Carvalho wrote:
 On Wed, Oct 19, 2011 at 12:06 PM, Marco van de Voort mar...@stack.nl 
wrote:
  Not that I know. And btw, I also use arm-linux without android, so please
  keep that target intact and aligned with normal linux ports.
 
 What is the difference between using cwstring and paswstring? Any
 reason for not wanting to use paswstring?
 
Where is paswstring?

Martin

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Hans-Peter Diettrich


Marco van de Voort schrieb:

On Mon, Oct 03, 2011 at 04:31:20PM +0200, Felipe Monteiro de Carvalho wrote:

Ok, I changed the define in rev 32655.

But you should note that when paswstring gets finished it will phase
out cwstrings.


Not that I know. And btw, I also use arm-linux without android, so please
keep that target intact and aligned with normal linux ports.


After some discussions in Embarcadero groups I would like to learn more 
about the FPC implementation and goals of the new (Unicode...) strings. 
Where should I have a look?


In detail it turned out that Delphi only supports CP_ACP strings for 
Ansi codepages, not including UTF-8. Strings with other encodings may be 
converted properly (not yet), but otherwise should not be used with 
standard stringhandling procedures. Will this be changed in the FPC RTL, 
so that at least UTF8Strings are also supported properly?


DoDi


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Wed, Oct 19, 2011 at 1:24 PM, Martin Schreiber mse00...@gmail.com wrote:
 Where is paswstring?

http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/components/lazutils/paswstring.pas?view=markuproot=lazarus

It uses lazutf8 (which includes most importantly UTF16ToUTF8 and
viceversa and utf8LowerCase and utf8UpperCase) and lconvencoding
(which includes encoding tables) which are in the same folder.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Sven Barth


Am 19.10.2011 14:08, schrieb Hans-Peter Diettrich:

Marco van de Voort schrieb:

On Mon, Oct 03, 2011 at 04:31:20PM +0200, Felipe Monteiro de Carvalho
wrote:

Ok, I changed the define in rev 32655.

But you should note that when paswstring gets finished it will phase
out cwstrings.


Not that I know. And btw, I also use arm-linux without android, so please
keep that target intact and aligned with normal linux ports.


After some discussions in Embarcadero groups I would like to learn more
about the FPC implementation and goals of the new (Unicode...) strings.
Where should I have a look?

In detail it turned out that Delphi only supports CP_ACP strings for
Ansi codepages, not including UTF-8. Strings with other encodings may be
converted properly (not yet), but otherwise should not be used with
standard stringhandling procedures. Will this be changed in the FPC RTL,
so that at least UTF8Strings are also supported properly?


Uhm... isn't this better suited in fpc-devel?

Regards,
Sven


--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Marco van de Voort

On Wed, Oct 19, 2011 at 01:14:50PM +0200, Felipe Monteiro de Carvalho wrote:
 On Wed, Oct 19, 2011 at 12:06 PM, Marco van de Voort mar...@stack.nl wrote:
  Not that I know. And btw, I also use arm-linux without android, so please
  keep that target intact and aligned with normal linux ports.
 
 What is the difference between using cwstring and paswstring? Any
 reason for not wanting to use paswstring?

Simply integrating with the OS, and avoid inclusion of tables when not
necessary. 

Moreover you are stating something as a fact here that was not discussed at
all.
 
 They should be 100% equal, except that one does not require any
 external libraries. If you can test and check if there are any
 differences of course would be excelent =)

I haven't been testing it, and don't plan to. I'm not interested in it, and
am not interested in growing the binaries unnecessarily.

I have no problem with having a second option for the people that do want
it, but that is something entirely different from what you were saying. 

Cwstring is staying on all normal targets as far as I know.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Wed, Oct 19, 2011 at 6:47 PM, Marco van de Voort mar...@stack.nl wrote:
 Moreover you are stating something as a fact here that was not discussed at
 all.

I am confused by your statements, the discussion here is about the
usage of cwstring in the LCL, then I said that I want to replace
cwstring with paswstring in the LCL (after making sure it is
completely equivalent).

Are you also discussing about the usage of cwstring in the LCL? Your
comments make me think that you are assuming I am talking about the
RTL or something like that.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Martin Schreiber

On Wednesday 19 October 2011 18.59:06 Felipe Monteiro de Carvalho wrote:
 On Wed, Oct 19, 2011 at 6:47 PM, Marco van de Voort mar...@stack.nl wrote:
  Moreover you are stating something as a fact here that was not discussed
  at all.
 
 I am confused by your statements, the discussion here is about the
 usage of cwstring in the LCL, then I said that I want to replace
 cwstring with paswstring in the LCL (after making sure it is
 completely equivalent).
 
 Are you also discussing about the usage of cwstring in the LCL? Your
 comments make me think that you are assuming I am talking about the
 RTL or something like that.

Ah, sorry, I read it wrong too...

Martin

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Marco van de Voort

On Wed, Oct 19, 2011 at 06:59:06PM +0200, Felipe Monteiro de Carvalho wrote:
 I am confused by your statements, the discussion here is about the
 usage of cwstring in the LCL, then I said that I want to replace
 cwstring with paswstring in the LCL (after making sure it is
 completely equivalent).
 
 Are you also discussing about the usage of cwstring in the LCL? Your
 comments make me think that you are assuming I am talking about the
 RTL or something like that.

No, sorry. Though I still think that is not a good thing either.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Vincent Snijders

2011/10/19 Marco van de Voort mar...@stack.nl:
 On Wed, Oct 19, 2011 at 06:59:06PM +0200, Felipe Monteiro de Carvalho wrote:
 I am confused by your statements, the discussion here is about the
 usage of cwstring in the LCL, then I said that I want to replace
 cwstring with paswstring in the LCL (after making sure it is
 completely equivalent).

 Are you also discussing about the usage of cwstring in the LCL? Your
 comments make me think that you are assuming I am talking about the
 RTL or something like that.

 No, sorry. Though I still think that is not a good thing either.

I guess Felipe gave up waiting on a Unicode RTL for the time being and
goes for a full UTF8 pseudo RTL in LazUtils.

Vincent

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On Wed, Oct 19, 2011 at 6:33 PM, Martin Schreiber mse00...@gmail.com wrote:
 Does it use locale specific collation in PasUnicodeCompareStr and
 PasUnicodeCompareText?

Good point, no, not yet. But this affects only turkish, azeri and
lithuanian AFAIK

Adding turkish and azeri is trivial, because UTF8LowerCase supports
them, but I did not understand yet the rules for Lithuanian, they are
quite convoluted, depend on nearby chars and stuff like that.

 Is the performance of UTF8LowerCase and UTF8UpperCase OK?

UTF8LowerCase was heavily optimized. UTF8UpperCase still needs to be
more optimized.

6 million UTF8LowerCase operations in the string АБВЕЁЖЗКЛМНОПРДЙГ
takes 2,6 seconds in my computer. It outperforms iconv by a factor of
2,5x aprox:

UTF8LowerCase-- Performance test took: 804 ms 1896 ms
   2318 ms 3460 ms 2647 ms 1847 ms 2526 ms 2496 ms
1830 ms 1975 ms
CWString SysUtils.UnicodeLowerCase-- Performance test took:
2456 ms 2461 ms 6594 ms 6170 ms 5347 ms 6939 ms
 4398 ms 4429 ms 2285 ms 2411 ms

For this strings:

  if j = 0 then Str := UTF8LowerCase('abcdefghijklmnopqrstuwvxyz');
  if j = 1 then Str := UTF8LowerCase('ABCDEFGHIJKLMNOPQRSTUWVXYZ');
  if j = 2 then Str := UTF8LowerCase('aąbcćdeęfghijklłmnńoóprsśtuwyzźż');
  if j = 3 then Str := UTF8LowerCase('AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ');
  if j = 4 then Str := UTF8LowerCase('АБВЕЁЖЗКЛМНОПРДЙГ');
  if j = 5 then Str := UTF8LowerCase('名字叫嘉英，嘉陵江的嘉，英國的英');
  if j = 6 then Str :=
UTF8LowerCase('AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuWvVwXxYyZz');
  if j = 7 then Str :=
UTF8LowerCase('AAaaBBbbCCccDDddEEeeFFffGGggHHhhIIiiJJjjKKkkLLllMMmm');
  if j = 8 then Str := UTF8LowerCase('abcDefgHijkLmnoPqrsTuwvXyz');
  if j = 9 then Str := UTF8LowerCase('ABCdEFGhIJKlMNOpQRStUWVxYZ');

 Do  UTF8LowerCase and UTF8UpperCase cover all upper/lowercase Unicode
 (possibly accented) characters?

UTF8LowerCase currently covers all characters in the latest Unicode
spec AFAIK. Of course I might have forgotten something, but I have
tests for chars from  to 0580 and more tests for other clusters.

UTF8UpperCase is currently implemented from  to 0450, but I will
add the rest.

 Does it handle decomposed characters (cwstring doesn't)?

I think that decomposed characters should work naturally. See, for
example, if we have: [0]=~ (tilde accent, but the special version for
composition) [1]=A which forms Ã and then we pass lowercase into it,
we would get [0] without change and [1]=a which forms ã. Or am I
wrong?

If you are talking about handling for CompareText, then the answer
would be that AFAIK it would be too inneficient to handle that in
CompareText ... so we would need another routine for that
NormalizedCompareText or something like that, which executes
normalization, then lowercase and finally the comparison.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

On 10/19/11, Vincent Snijders vincent.snijd...@gmail.com wrote:
 I guess Felipe gave up waiting on a Unicode RTL for the time being and
 goes for a full UTF8 pseudo RTL in LazUtils.

Well, after a lot of discussion I got convinced that Lazarus should
give a try at the UTF-8 mode of the RTL when this appears, and this
might be very useful for our usage of TStringList, TComponent,
TStream, etc. I think this solution has major problems, but it was
claimed that my proposed solutions have much worse problems, so in the
end I concluded that we should try the UTF-8 mode of the RTL when it
appears.

But this does not mean that LazUtils would not be useful then. My
proposals to add UTF-8 routines to the RTL and even FCL were rejected,
so we UTF-8 users would need to be stuck with only whatever routines
Embarcadero invents. That's not nearly good enough and not nearly fast
enough. UTF8LowerCase is very superior to the existing RTL LowerCase.
To start with, the RTL in existing release doesn't even have a
UTF8String LowerCase (no idea about 2.7). Also, UTF8LowerCase has a
second parameter to specify the language, so we can test and support
Turkish without having to change our locale to turkish, and it
outperforms SysUtils.UnicodeLowerCase by 250% aprox in my Mac, and it
has zero external dependencies while depending on zero initialization
code, zero global variables and having 1k lines of code (half of them
comments), which is not that much. As you can see it vastly
outperforms even what the UTF-8 mode of the RTL would offer for this.

Just like UTF8LowerCase, other things provided by LazUtils will also
be useful options for Lazarus and other libraries/applications,
regardless of FPC offering something similar. And then I think that
everyone will be happy. People that want Delphi compatibility
(excluding string and PChar, since they will not match in the RTL mode
used by Lazarus) will be happy, they can use RTL routines and get
compatibility. Lazarus will still be using string and TStringList,
TComponent, etc.

-- 
Felipe Monteiro de Carvalho

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Michael Van Canneyt




On Wed, 19 Oct 2011, Felipe Monteiro de Carvalho wrote:


On 10/19/11, Vincent Snijders vincent.snijd...@gmail.com wrote:

I guess Felipe gave up waiting on a Unicode RTL for the time being and
goes for a full UTF8 pseudo RTL in LazUtils.


Well, after a lot of discussion I got convinced that Lazarus should
give a try at the UTF-8 mode of the RTL when this appears, and this
might be very useful for our usage of TStringList, TComponent,
TStream, etc. I think this solution has major problems, but it was
claimed that my proposed solutions have much worse problems, so in the
end I concluded that we should try the UTF-8 mode of the RTL when it
appears.

But this does not mean that LazUtils would not be useful then. My
proposals to add UTF-8 routines to the RTL and even FCL were rejected,


Correction: Your proposals were not rejected.

No decision as to which character sets will be used in the basic RTL 
has been taken. Any action you take now is therefor premature.


So it was suggested you would wait till things settle down till and 
the final shape of things are more clear.


This really is not the same as 'rejected'.

Michael.

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Vincent Snijders

2011/10/19 Michael Van Canneyt mich...@freepascal.org:


 On Wed, 19 Oct 2011, Felipe Monteiro de Carvalho wrote:

 On 10/19/11, Vincent Snijders vincent.snijd...@gmail.com wrote:

 I guess Felipe gave up waiting on a Unicode RTL for the time being and
 goes for a full UTF8 pseudo RTL in LazUtils.




 But this does not mean that LazUtils would not be useful then. My
 proposals to add UTF-8 routines to the RTL and even FCL were rejected,

 Correction: Your proposals were not rejected.

Thanks for the clarification.


 No decision as to which character sets will be used in the basic RTL has
 been taken. Any action you take now is therefor premature.

 So it was suggested you would wait till things settle down till and the
 final shape of things are more clear.

That is why I said: gave up waiting


 This really is not the same as 'rejected'.

Vincent

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux

2011-10-19 Thread Žilvinas Ledas


Hello,

On 2011-10-19 21:03, Felipe Monteiro de Carvalho wrote:

On Wed, Oct 19, 2011 at 6:33 PM, Martin Schreibermse00...@gmail.com  wrote:

Does it use locale specific collation in PasUnicodeCompareStr and
PasUnicodeCompareText?

Good point, no, not yet. But this affects only turkish, azeri and
lithuanian AFAIK

Adding turkish and azeri is trivial, because UTF8LowerCase supports
them, but I did not understand yet the rules for Lithuanian, they are
quite convoluted, depend on nearby chars and stuff like that.
I am native Lithuanian so I think can help at least providing info, but 
I must understand what is the problem first.
Do I understand correctly, that collation means sorting order? In 
that case Lithuanian does not depend on near by characters.

There are 32 letters and they follow this order:
Aa  Ąą  Bb  Cc  Čč  Dd  Ee  Ęę  Ėė  Ff  Gg  Hh  Ii  Įį  Yy 
 Jj  Kk  Ll  Mm  Nn  Oo  Pp  Rr  Ss  Šš  Tt  Uu  Ųų  Ūū  
Vv  Zz  Žž


And there are some accented characters which are used only in linguistic 
texts (for example, dictionaries). (All list is here: 
http://developer.mimer.com/charts/lithuanian.htm)


The funny thing is that in dictionaries when sorting words, Aa and 
Ąą (also: Ee and Ęę and Ėė; Ii and Įį and Yy; Uu and 
Ųų and Ūū) are treated as the same letter.
BUT, for example words šieną  sieną  sieną - all three are 
different words (no accents in these characters).
BUT I believe that accented characters should be treated as the same 
letter: šiẽną = šieną; siena = síena, because it is the same 
word (accents do not change word meaning and are totally not required to 
be provided by the text writer).


I don't know if I managed to explain anything, but if you'll need some 
help with Lithuanian language - feel free to contact me.



Regards,
Žilvinas Ledas

--
___
Lazarus mailing list
Lazarus@lists.lazarus.freepascal.org
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] cwstring in arm-linux