Re: [fpc-devel] Unicodestring branch, please test and help fixing

listmember Fri, 12 Sep 2008 08:07:53 -0700

[Note that, here 'TCharacter' isn't necessarily an object; it might as
well be a simple record structure.]


AFAIK for most programmers this is not a common task. Most programs need less
(one language or codepage)


But, when you're talking unicode, codepage is rather meaningless --isn't it?

or more (phonetic, semantic, statistical search).
Can you explain, why you think that this particular problem requires compiler
magic?


See my other reply to Martin Friebe, in another sub thread.

Is there, in Unicode, start-stop markes that denote 'language'?


Is it needed?
Are the any unicode characters, that upper/lower depend on language?


Yes. See my other reply to Martin Friebe, in another sub thread.

Take this, for example:

"if SameText(SomeString, SomeOtherString) then do ..."

For this to work properly, in both 'SomeString' and 'SomeOtherString',
you need to know which language *each* character belongs to.


Comparing texts can be done with various meanings. For example: byte comparison,
simple case insensitive comparison, not literal comparison, compare like this
library, ....
Which one do you mean?


Byte comparison isn't what I am worried about.

In every language, there a pretty known and fixed (by now) rules thatapply to string comparison. I am referring to those rules.

[...]
Here is a simple example for you:

"if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ..."

Now.. how are you going to decide that SameText() function here returns
true unless you have information that the substring 'FoolStraße' is in
German?


The two strings have the same language, but are written with different
Rechtschreibung. You need dictionaries and spelling systems to implement such
comparisons. This is beyond a compiler or a RTL.

Are you sure. I was under the impression that Unicode covers these--without needing further data.

What about loan words?

For all practical purposes, 'loan words' belong to the language they areused in.


Except the case where we'd be discussing etymology.

SameText('Istanbul', 'istanbul') can only return true when both
'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.

Otherwise, the same SameText() has to return false.


I doubt that it is that easy.


Well.. I never said that it would be that easy.

But, if strip off the language attribute from the caharcater, it will beimpossible --or several orders of magnitude harder for those people whoneed it.


You can, of course, ignore all that.

But, then, what is the point of going unicode?

We were just fine doing things ANSI-centric..

Weren't we?
_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Reply via email to