Re: [fpc-devel] Unicode in the RTL (my ideas)

Hans-Peter Diettrich Wed, 22 Aug 2012 18:47:23 -0700

Daniël Mantione schrieb:

Op Wed, 22 Aug 2012, schreef Felipe Monteiro de Carvalho:

On Wed, Aug 22, 2012 at 9:36 PM, Martin Schreiber <mse00...@gmail.com>wrote:
I am not talking about Unicode. I am talking about day by dayprogramming ofan average programmer where the live is easier with utf-16 than withutf-8.
Unicode is not done by using pos() instead of character indexes.
I think everybody knows my opinion, I stop now.
Please be clear in the terminogy. Don't say "live is easier with
utf-16 than with utf-8" if you don't mean utf-16 as it is. Just say
"live is easier with ucs-2 than with utf-8", then everything is clear
that you are talking about ucs2 and not true utf-16.


That is nonsense.

* There are no whitespace characters beyond widechar range. This means you
  can write a routine to split a string into words without bothing about
  surrogate pairs and remain fully UTF-16 compliant.


How is this different for UTF-8?

* There are no characters with uppper/lowercase beyond widechar range.
  That means if you write cade that deals with character case you don't
  need to bother with surrogate pairs and still remain fully UTF-16
  complaint.


How expensive is a Unicode Upper/LowerCase conversion per se?

* You can group Korean letters into Korean syllables, again without
  bothering about surrogate pairs, as Korean is one of the many languages
  that is entirely in widechar range.


The same applies to English and UTF-8 ;-)
Selected languages can be handled in special ways, but not all.

Many more examples exist. It's true there exist also many examples wheresurrogates do need to be handled.
But... even if a certain piece of code doesn't handle e.g. Egyptianhyroglyps correctly; there is no guarantee that a UTF-8 code would do,since these scripts have many properties that are not compatible withtext processing codes designed for western languages, they need a lot ofcustom code.


That's it!

In everydays coding I'm happy with AnsiStrings, covering English andGerman. But when I want to deal with Unicode, except for display-onlypurposes, I want to do it right and in the most simple way. This meansthat I'd call the functions existing (in FPC?) for detectingnon-breakable character ranges, upper/lower case conversion etc., anduse (sub)strings all over to get rid of any byte/wordcount issues.

You mentioned Korean syllables splitting - is this a task occuring oftenin Korean programs? I don't remember when I *ever* wanted to breakGerman or English words into syllables. At the begin of computer-basedpublishing most German texts were hard to read, due to many wordbreakerrors. Finding syllables (as possible breakpoints), in detail inforeign languages, still requires to use according library functions,which do (hopefully) proper disambiguation. In my code I'd call theGetSyllable function, and then split the string at the given points -regardless of any encoding. Or, as I really did, break strings only atword boundaries, again insensitive to any encoding.

Also breaking strings for display purposes, at a given pixel count, isexpensive. It's not sufficient to find possible breakpoints, it's alsorequired to narrow down the right breakpoint by repetitive tries. It'snot a good idea to simply add the width of individual characters,instead the pixel width of every possible substring must be determinedindividually. This means that the efficiency does not depend much on thestring encoding.

But another point becomes *really* important, when libraries withbeforementioned Unicode functions are used: The application andlibraries should use the *same* string encoding, to prevent frequentconversions with every function call. This suggests to use thelibrary(=platform) specific string encoding, which can be different one.g. Windows and Linux.

Consequently a cross-platform program should be as insensitive aspossible to encodings, and the whole UTF-8/16 discussion turns out to bepurely academic. This leads again to an different issue: should wedeclare an string type dedicated to Unicode text processing, which canvary depending on the platform/library encoding? Then everybody candecide whether to use one string type (RTL/FCL/LCL compatible) forgeneral tasks, and the library compatible type for text processing?

Or should we bite the bullet and support different flavors of the FPClibraries, for best performance on any platform? This would also leaveit to the user to select his preferred encoding, stopping any UTFdiscussion immediately :-]


DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicode in the RTL (my ideas)

Reply via email to