date:20080912

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Martin Friebe

Just to make sure, all of this discussion is based on various collation 
for European languages? Or shall we include Arabic, Chinese and other 
languages? But they have there own chars, they can be identified without 
collation, so they do not need the language info, to be distinguished 
from European text. (They may have collations, the same as a German text 
could be handled in different collations)


listmember wrote:

So maybe the design is quite well thought?


Adding a flag field is easy enough --if all you're doing is to do some 
sort of collation. In that sense, everything is well tought out.


But..

Life becomes very complicated when you begin to do things like FTS 
(full text search) on a multilanguage text in a DB engine.


Your options, in this case, is just very limited:
  -- Ignore the langage issue.
or
  -- store each language in a different field (that is if you know how 
many there will be).


Do you think this is a good solution --or, a hack.

True, that would be hard to do (in DB or pascal, or most other places). 
But again this is a very special case. And that is why none of the 
frameworks (DB, pascal, ...) include it. You have to do your own solution.


At no time did I say (nor did afaik anyone else say) that you can not do 
your own object based text holding objects.

The question were:
1) should FPC replace the string, by an object (like java)
2) which additional attributes should be stored by a string (per string 
/ per char)


And actually both of those question can be moved out of the context of 
Unicode implementation. Because, both of them could also bee applied to 
current (char=byte) based strings.


I am going to leave out the object question for now. I said all I can 
say in earlier mails. And also from your comments it appears more a 
question of collation being stored with the string, substring, or even 
each char.


As found in the last mail, there is currently no standard for handling 
cross-collation in any string function (that is string function, which 
could be collation based).
1) IMHO only few people would need this. For the majority it would be 
unwanted overhead.
2) Within those few, there would be too many different Expectation as to 
what the standard should be. If FPC choose one such standard at will, 
it would benefit almost no one.


The best FPC could to is provide storage, for something that is not 
handled or obeyed in any function handling the data. This doesn't sound 
desirable to me. If anyone who needs it will have to implement the 
functions, then those may add there own storage for it too.


Besides instead of storing it per char, you can use unused unicode as 
start/stop markers. So it can be implemented on top of a string that 
stores unicode-chars (and chars only, no attributes)



As for Storing info per string or per char. (Info could be anything:
collation, color, style, font, source-of-quote, author, creation-date,
file, ) everyone would like there own. So again FPC shouldn't do it.
Or everyone gets all the overhead of what all the others wanted.

Collation is a function of language.
Right but language is something you can apply to strings. You are not 
forced to do so. Strings work very well without language too.
Same as you saying no gui. Strings work without display. Font/Style is 
a function of rendering. I may want to search a string but only want to 
look at chars marked as bold.


Languages is an extension to string, in the same way than rendering 
info, or source info is. To you language may matter a great deal. To 
others other attirbutes will matter.
All the others are not an intrinsic part of o a char at all --they 
vary by context.
Why is language intrinsic to the text? An A is an A in any language. 
At best language is intrinsic to sorting/comparing(case on non 
case-sense) text

If pascal doesn't suit the need of a specific task, choose a different
tool. Instead of inventing a new pascal.


Thank you for the advice.
But, instead of jailing this discussion to at best a laterally 
relevant issue of collation, can I ask you to think for a moment:
How on earth can you do a case-INsensitive search in *any* given 
string contains multiple language substrings?


Please note the 'case-INsensitive' keyword there.
Well I needed an actual example where case sense differs by language 
(assuming we talk about language using the same charset (not comparing 
Chinese whit English).


In any case, I can write up several different algorithms  how to do 
that. What I can not do (or what I do not want to do) is to decide which 
of them other people do want to use.


search none-case-sensitive 'UP LOW' in ' ups upper lows lower'

with the following attributes:
'UP LOW' is a string of 2 languages.
The word UP is in a language that defines U and u as different 
letters (not only differ by case, but differ the same as a and b do 
differ)
The word LOW is in a languages where all letters are having low-case 
equivalents (as in English)


'ups'

Re: [fpc-devel] Overloaded Pos bug

2008-09-12 Thread Vincent Snijders


[EMAIL PROTECTED] schreef:

In fpc revision 11746 i cannot compile this construction:

program posx;
var
  s, s1: WideString;
begin
  Pos(s[1], s1);
end.

fpc -vh posx.pas
Hint: Start of reading config file /etc/fpc.cfg
Hint: End of reading config file /etc/fpc.cfg
Free Pascal Compiler version 2.3.1 [2008/09/11] for i386
Copyright (c) 1993-2008 by Florian Klaempfl
posx.pas(5,3) Error: Can't determine which overloaded function to call
ustrings.inc(1524,10) Hint: Found declaration: Pos(UnicodeChar,const 
UnicodeString):LongInt
ustrings.inc(1497,10) Hint: Found declaration: Pos(const UnicodeString,const 
UnicodeString):LongInt
posx.pas(7) Fatal: There were 1 errors compiling module, stopping
Fatal: Compilation aborted

Kylix and about month erlier fpc was ok.



Please, create a bug report.

Vincent
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread JoshyFun

Hello ABorka,

Friday, September 12, 2008, 2:30:35 AM, you wrote:

A Thanks for pointing me to the Lazarus thread about this and the bug
A report. Checked them.
A But as I understand there is no solution available at the moment for this.

I had partially solved the problem using the handler OnGetText ?
(I'm not sure about the name) for each field which is somehow dirty
forcing a codepage to UTF8 conversion (in Lazarus you will find some
codepage-UTF conversions available).

A I have a database that is not encoded utf8 (and it will never be because
A other client programs are accessing it and their users do not want/need
A to be converted to unicode). How do I get the field values into 
A FPC/Lazarus into a string variable? Right now the non-unicode strings
A are returned as empty from a database field due to FCL conversion functions.

If you will need this as a fixed solution for this project maybe you
can think in create a new database unit file based in the current one
(change the name of course) with hardcoded UTF8 encoding from codepage
for each string once retrieved from the database. Take care about
string length as UTF8 ones will be equal or longer than the original
ones.

A Not to mention writing something to the database back.
A Is there a function to convert 'My Perfect World®' to whatever format
A the components require and vice versa? Something for the ASCII table up
A till #255 (English letters with some special characters like the above
A example).

Check lconvencoding.pas in the LCL folder of Lazarus.

-- 
Best regards,
 JoshyFun

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Joost van der Sluis

Op vrijdag 12-09-2008 om 13:22 uur [tijdzone +0200], schreef JoshyFun:

 A Thanks for pointing me to the Lazarus thread about this and the bug
 A report. Checked them.
 A But as I understand there is no solution available at the moment for this.
 
 I had partially solved the problem using the handler OnGetText ?
 (I'm not sure about the name) for each field which is somehow dirty
 forcing a codepage to UTF8 conversion (in Lazarus you will find some
 codepage-UTF conversions available).

I think that the original poster didn't looked very well in the
archives, this solution is told here quite often.

 A I have a database that is not encoded utf8 (and it will never be because
 A other client programs are accessing it and their users do not want/need
 A to be converted to unicode). How do I get the field values into 
 A FPC/Lazarus into a string variable? Right now the non-unicode strings
 A are returned as empty from a database field due to FCL conversion 
 functions.
 
 If you will need this as a fixed solution for this project maybe you
 can think in create a new database unit file based in the current one
 (change the name of course) with hardcoded UTF8 encoding from codepage
 for each string once retrieved from the database. Take care about
 string length as UTF8 ones will be equal or longer than the original
 ones.

You can just override one single method to do this. This is also told a
few times on this list.

Joost.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread listmember


Martin Friebe wrote:

Just to make sure, all of this discussion is based on various collation


No part of this discussion is based on collation.


I am going to leave out the object question for now. I said all I can
say in earlier mails.


That's good. Thank you.


And also from your comments it appears more a question of collation

 being stored with the string, substring, or even each char.

Martin, are you doing this on purpose? I mean, are you intentionaly 
driving me up the wall?


Seriously. Can't you forget/drop this 'collation' word?!

And, then, think a little deeper.

Here is a scenario for you:

You have multilanguage text as data. Someone has asked you to search it 
and see if a certain peice of string (in a given language) exists in it.


This search needs to be NOT case-sensitive.

How can you do this?

Is it doable if TCharacter (or wahtever you call it) has no 'langauge' 
attribite?


[Note that, here 'TCharacter' isn't necessarily an object; it might as 
well be a simple record structure.]



As found in the last mail, there is currently no standard for handling
cross-collation in any string function (that is string function, which
could be collation based).
1) IMHO only few people would need this. For the majority it would be
unwanted overhead.
2) Within those few, there would be too many different Expectation as to
what the standard should be. If FPC choose one such standard at will,
it would benefit almost no one.


You're still stuck with that wretched word 'collation'.


The best FPC could to is provide storage, for something that is not
handled or obeyed in any function handling the data. This doesn't sound
desirable to me. If anyone who needs it will have to implement the
functions, then those may add there own storage for it too.

Besides instead of storing it per char, you can use unused unicode as
start/stop markers. So it can be implemented on top of a string that
stores unicode-chars (and chars only, no attributes)


Is there, in Unicode, start-stop markes that denote 'language'?


All the others are not an intrinsic part of o a char at all --they
vary by context.



Why is language intrinsic to the text? An A is an A in any language.
At best language is intrinsic to sorting/comparing(case on non
case-sense) text


Comparing is a lot more important an operation than collating --or, 
rather, collation is achieveable only if you can do proper comparisons.


Take this, for example:

if SameText(SomeString, SomeOtherString) then do ...

For this to work properly, in both 'SomeString' and 'SomeOtherString', 
you need to know which language *each* character belongs to.


If you dont have that informtaion, you might as well not have a 
SameText() function in FPC.



Please note the 'case-INsensitive' keyword there.

Well I needed an actual example where case sense differs by language
(assuming we talk about language using the same charset (not comparing
Chinese whit English).


Here is a simple example for you:

if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ...

Now.. how are you going to decide that SameText() function here returns 
true unless you have information that the substring 'FoolStraße' is in 
German?


I know that this is a very simple example --that 'ß' exists only in 
German, and that you could infer that when you met that char.


But, this hightlights the problem --and there are times when you cannot 
infer.



In any case, I can write up several different algorithms how to do that.


Please do. SameText(), for one, will need all the help it can get.


What I can not do (or what I do not want to do) is to decide which of
them other people do want to use.


But, isn't this just that: IOW, you're deciding what other people will 
NOT want to use if you throw the 'language' attribute (for each char) 
out of the window..



Or, if this is not what you think of, please clarify by example..


Here is another typical example:

SameText('Istanbul', 'istanbul') can only return true when both 
'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.


Otherwise, the same SameText() has to return false.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Mattias Gärtner

Zitat von Joost van der Sluis [EMAIL PROTECTED]:

 Op vrijdag 12-09-2008 om 13:22 uur [tijdzone +0200], schreef JoshyFun:

  A Thanks for pointing me to the Lazarus thread about this and the bug
  A report. Checked them.
  A But as I understand there is no solution available at the moment for
 this.
 
  I had partially solved the problem using the handler OnGetText ?
  (I'm not sure about the name) for each field which is somehow dirty
  forcing a codepage to UTF8 conversion (in Lazarus you will find some
  codepage-UTF conversions available).

 I think that the original poster didn't looked very well in the
 archives, this solution is told here quite often.

  A I have a database that is not encoded utf8 (and it will never be because
  A other client programs are accessing it and their users do not want/need
  A to be converted to unicode). How do I get the field values into
  A FPC/Lazarus into a string variable? Right now the non-unicode strings
  A are returned as empty from a database field due to FCL conversion
 functions.
 
  If you will need this as a fixed solution for this project maybe you
  can think in create a new database unit file based in the current one
  (change the name of course) with hardcoded UTF8 encoding from codepage
  for each string once retrieved from the database. Take care about
  string length as UTF8 ones will be equal or longer than the original
  ones.

 You can just override one single method to do this. This is also told a
 few times on this list.

Maybe it is not documented at the right place?

Mattias

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Daniël Mantione




Op Fri, 12 Sep 2008, schreef listmember:


This search needs to be NOT case-sensitive.

How can you do this?

Is it doable if TCharacter (or wahtever you call it) has no 'langauge' 
attribite?


'I am on FoolStrasse' versus 'I am on FoolStraße' is not a upper/lower 
case issue. Strasse and Straße have the same casing. So yes, you can do 
case-insensitive search.


The problem you describe does exists. ü and ue are equivalent in German, 
but not in Dutch. So someone searching for ü will also want to receive 
results for ue, a Dutch speaking person would not.


This however, should not be fixed at the string level, but at the file 
format level. I.e. in HTML you can do DIV lang='nl'. You could design a 
#27 escape code for text files if you'd like.


Daniël___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Florian Klaempfl

Daniël Mantione schrieb:
 
 
 Op Fri, 12 Sep 2008, schreef listmember:
 
 This search needs to be NOT case-sensitive.

 How can you do this?

 Is it doable if TCharacter (or wahtever you call it) has no 'langauge'
 attribite?
 
 'I am on FoolStrasse' versus 'I am on FoolStraße' is not a upper/lower
 case issue. Strasse and Straße have the same casing. So yes, you can do
 case-insensitive search.
 
 The problem you describe does exists. ü and ue are equivalent in German,

Not in both directions.

 but not in Dutch. So someone searching for ü will also want to receive
 results for ue,
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Joost van der Sluis

Op vrijdag 12-09-2008 om 15:56 uur [tijdzone +0200], schreef Mattias
Gärtner:
 Zitat von Joost van der Sluis [EMAIL PROTECTED]:
 
  Op vrijdag 12-09-2008 om 13:22 uur [tijdzone +0200], schreef JoshyFun:
 
   A Thanks for pointing me to the Lazarus thread about this and the bug
   A report. Checked them.
   A But as I understand there is no solution available at the moment for
  this.
  
   I had partially solved the problem using the handler OnGetText ?
   (I'm not sure about the name) for each field which is somehow dirty
   forcing a codepage to UTF8 conversion (in Lazarus you will find some
   codepage-UTF conversions available).
 
  I think that the original poster didn't looked very well in the
  archives, this solution is told here quite often.
 
   A I have a database that is not encoded utf8 (and it will never be 
   because
   A other client programs are accessing it and their users do not want/need
   A to be converted to unicode). How do I get the field values into
   A FPC/Lazarus into a string variable? Right now the non-unicode strings
   A are returned as empty from a database field due to FCL conversion
  functions.
  
   If you will need this as a fixed solution for this project maybe you
   can think in create a new database unit file based in the current one
   (change the name of course) with hardcoded UTF8 encoding from codepage
   for each string once retrieved from the database. Take care about
   string length as UTF8 ones will be equal or longer than the original
   ones.
 
  You can just override one single method to do this. This is also told a
  few times on this list.
 
 Maybe it is not documented at the right place?

It is not documented at all. Just like the rest of the database-stuff.
But maybe I should write a FAQ for fpc. With the new lazarus-versions
using UTF-8 by default, this is asked quite often.

Joost

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Martin Friebe


listmember wrote:

Martin Friebe wrote:

Just to make sure, all of this discussion is based on various collation

No part of this discussion is based on collation.

Ok, so we were talking about different things


Here is a scenario for you:

You have multilanguage text as data. Someone has asked you to search 
it and see if a certain peice of string (in a given language) exists 
in it.

This search needs to be NOT case-sensitive.
Actually for you example case doesn't matter. as you need to decide if 
ss = ß

How can you do this?
Is it doable if TCharacter (or wahtever you call it) has no 'langauge' 
attribite?


For the purpose of case-sensitivity. I still do not know of a character 
or rather a pair of upper and lower case char)  that maps different in 
some languages?
Is there a pair of character x and X  which should in some languages 
be matching upper/lower, but in other languages should not?

^^ ignore, found your example at the end of mail

Otherwise how do I understand the case-insensitive part of your 
question? Because if x is the lowercase of X in *all* languages, 
then I do not need the language specific info to do the 
none-case-sensitive compare.


Sorry if I am still missing some point...

[Note that, here 'TCharacter' isn't necessarily an object; it might as 
well be a simple record structure.]

Yes we agreed on this part


Besides instead of storing it per char, you can use unused unicode as
start/stop markers. So it can be implemented on top of a string that
stores unicode-chars (and chars only, no attributes)

Is there, in Unicode, start-stop markes that denote 'language'?
I do not know, that was why I said unused unicode and implemented on 
top (as part of the specific app)


IMHO The discussion splits here between:
1) How can this be done in a specific app
2) what should fpc provide

as for 2: This would be on top of yet (afaik) missing basic functions 
such as
Compare using collation x (where collation is given as argument to 
compare, not as part of any string)

Why is language intrinsic to the text? An A is an A in any language.
At best language is intrinsic to sorting/comparing(case on non
case-sense) text


Comparing is a lot more important an operation than collating --or, 
rather, collation is achieveable only if you can do proper comparisons.


Take this, for example:

if SameText(SomeString, SomeOtherString) then do ...
For this to work properly, in both 'SomeString' and 'SomeOtherString', 
you need to know which language *each* character belongs to.

I would rather say:
There are special cases where you need/want to know which language

So I do not imply how special or none special those cases are = you do 
not always need to know. (continued below on your example)




If you dont have that informtaion, you might as well not have a 
SameText() function in FPC.



Please note the 'case-INsensitive' keyword there.

Well I needed an actual example where case sense differs by language
(assuming we talk about language using the same charset (not comparing
Chinese whit English).


Here is a simple example for you:

if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ...

Well that is a good question, do you always want that to return the same?
Busstop and Bußtop (Yeah the second is not a word, but could occur 
in a text)


Also in Names this comparisons does not always apply.

the Name Heiße (originally with ß) can be spelled as Heisse
But the Name Heisse (originally with ss) is never the same has Heiße


But as for asking me: This a specialized comparison, Similar to soundex 
(compare sound of 2 words, usually based on english)
Something like this is usually found in extension libraries, but not in 
the standard functionally of a (many/most) languages.


In any case I think this also has the minority problem. Most people do 
not want to compare pascal strings this way (and if it only is because 
of false positives)




That does not mean that I say such functionality is not desirable. It 
would be great having a unit that can be used if needed.


Based on the idea that this are optional (or 3rd party) functions, the 
normal String would not provide for this. (Besides attaching info to 
each char would probably be to costly, even if implemented in the fpc 
core string.)
Functions like this could take an additional structure declaring the 
start/stop/change point of every language.




In any case, I can write up several different algorithms how to do that.

Please do. SameText(), for one, will need all the help it can get.
The initial comment was based on collation, and basically would have 
been about prioritizing in conflicts.


There are 2 parts:
1) identifying the language.

I would recommend a separate structure, with all language start points. 
It takes some work to maintain, but should work


alternatively use dynarray instead of string. Define a record holding 
all info per char that you need. overload all operators for you dynarray 
tyope, to behave as

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread listmember


Actually for you example case doesn't matter. as you need to decide if
ss = ß


And, this is only valid in German. For all other, the result must either 
be false, or undefined.



Is there, in Unicode, start-stop markes that denote 'language'?



I do not know, that was why I said unused unicode and implemented on
top (as part of the specific app)


As far as I know, there isn't a language delimiter in Unicode.


IMHO The discussion splits here between:
1) How can this be done in a specific app
2) what should fpc provide

as for 2: This would be on top of yet (afaik) missing basic functions
such as
Compare using collation x (where collation is given as argument to
compare, not as part of any string)


I think we're beginning to be on the same page --but, please, can you 
refrain from using the word 'collation'; every time I see that in this 
context, I feel a strong need to open the window and shout collation 
isn't the most important/used part of a language wrt programming :)



Take this, for example:

if SameText(SomeString, SomeOtherString) then do ...
For this to work properly, in both 'SomeString' and 'SomeOtherString',
you need to know which language *each* character belongs to.



I would rather say:
There are special cases where you need/want to know which language


Yes. And, if we're on our way to make FPC unicode-enabled, we need to 
take these special cases into account. Otherwise, we will likely end up 
with a half baked 'solution'.



So I do not imply how special or none special those cases are = you do
not always need to know. (continued below on your example)


Why would I need to ALWAYS need it. Isn't 'needed when necessary' good 
enough?



2) actual compare, you need to normalize all strings before comparing,
then compare the normalized string as bytes.

normalizing means for each char to decide how to represent it. German
ae could be represented as a umlaut for the compare.
Or (in German text) you expand all umlaute first.


IOW, SameText() and similar stuff must take normalization into account.

But, you do know that 'normalization' is a very rough assumption and 
land you in some very embarassing situations.


Here is 2 words from Turkish.

1) 'sıkıcı' which means 'boring' in English (notice the dotless small 'i's)

2) 'sikici' which means 'fucker' in English

Now, when you normalize these you get 'SIKICI' for both which --then-- 
you would assume to be the same.


Well.. I'd like to see you (or your boss) when you've come up will all 
those 'fucker's instead of all those 'boring' old farts you were lookin 
for :P


[You might probably think of a German --or some othe language-- example]

IOW, what I am trying to tell you is that normalization isn't really 
useful --it is, IMO, a stopgap solution along the path of Unicode evolution.



BUT of course there is no way do deal with the ambitious Busstop


In deed. For this case, you need to know what language Busstop was 
written in.



What I can not do (or what I do not want to do) is to decide which of
them other people do want to use.

But, isn't this just that: IOW, you're deciding what other people will
NOT want to use if you throw the 'language' attribute (for each char)
out of the window..



True, I am happy to do that. NOT


I am glad we have met :)


Why you can always extend this. Store you string in any of the following
ways
1) every 2nd char is a language attribute, not a char
2) store the language attributes in a 2nd string, always pass both
strings around


Of course, these and even more creative hacks could be devised.

The question is, is the language an attribute of a unicode character?


SameText('Istanbul', 'istanbul') can only return true when both
'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.



ok thats what I did not know. But still in most cases it will be fine to do
SameText('Istanbul', 'istanbul', lGerman)
SameText('Istanbul', 'istanbul', lTurkish)
decide at the time of comparing


Well, the prototype I had in mind was:

SameText('Istanbul', 'istanbul', lGerman, lTurkish)

weher the defaults for the latter 2 parameters would be lUnknown --this 
way, people who needen't be bothered about these would not even notice.



If however the info was stored on the string (or char) what if one was
Turkish, the other German ?


SameText('Istanbul', 'istanbul', lTurkish, lGerman)

This one must return FALSE since, in Turkish, uppercased dotted small 
'i' is DOTTED capital 'i' (i.e. 'İ').


and,

SameText('Istanbul', 'istanbul', lTurkish, lGerman)

will return TRUE since uppercasing both sides result in the same string.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread listmember


[Note that, here 'TCharacter' isn't necessarily an object; it might as
well be a simple record structure.]


AFAIK for most programmers this is not a common task. Most programs need less
(one language or codepage)


But, when you're talking unicode, codepage is rather meaningless --isn't it?


or more (phonetic, semantic, statistical search).
Can you explain, why you think that this particular problem requires compiler
magic?


See my other reply to Martin Friebe, in another sub thread.


Is there, in Unicode, start-stop markes that denote 'language'?


Is it needed?
Are the any unicode characters, that upper/lower depend on language?


Yes. See my other reply to Martin Friebe, in another sub thread.


Take this, for example:

if SameText(SomeString, SomeOtherString) then do ...

For this to work properly, in both 'SomeString' and 'SomeOtherString',
you need to know which language *each* character belongs to.


Comparing texts can be done with various meanings. For example: byte comparison,
simple case insensitive comparison, not literal comparison, compare like this
library, 
Which one do you mean?


Byte comparison isn't what I am worried about.

In every language, there a pretty known and fixed (by now) rules that 
apply to string comparison. I am referring to those rules.



[...]
Here is a simple example for you:

if SameText('I am on FoolStrasse', 'I am on FoolStraße') then do ...

Now.. how are you going to decide that SameText() function here returns
true unless you have information that the substring 'FoolStraße' is in
German?


The two strings have the same language, but are written with different
Rechtschreibung. You need dictionaries and spelling systems to implement such
comparisons. This is beyond a compiler or a RTL.


Are you sure. I was under the impression that Unicode covers these 
--without needing further data.



What about loan words?


For all practical purposes, 'loan words' belong to the language they are 
used in.


Except the case where we'd be discussing etymology.


SameText('Istanbul', 'istanbul') can only return true when both
'Istanbul' and 'istanbul' are *not* in Turkish/Azerbeijani.

Otherwise, the same SameText() has to return false.


I doubt that it is that easy.


Well.. I never said that it would be that easy.

But, if strip off the language attribute from the caharcater, it will be 
impossible --or several orders of magnitude harder for those people who 
need it.


You can, of course, ignore all that.

But, then, what is the point of going unicode?

We were just fine doing things ANSI-centric..

Weren't we?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread Martin Friebe


listmember wrote:

IMHO The discussion splits here between:
1) How can this be done in a specific app
2) what should fpc provide

as for 2: This would be on top of yet (afaik) missing basic functions
such as
Compare using collation x (where collation is given as argument to
compare, not as part of any string)
I think we're beginning to be on the same page --but, please, can you 
refrain from using the word 'collation'; every time I see that in this 
context, I feel a strong need to open the window and shout collation 
isn't the most important/used part of a language wrt programming :)
Sorry, but I meant comparing with collation. I did not mean comapring 
within labguage context.


language context is to complex to be basic (see busstop below)

2) actual compare, you need to normalize all strings before comparing,
then compare the normalized string as bytes.

normalizing means for each char to decide how to represent it. German
ae could be represented as a umlaut for the compare.
Or (in German text) you expand all umlaute first.


IOW, SameText() and similar stuff must take normalization into account.

But, you do know that 'normalization' is a very rough assumption and 
land you in some very embarassing situations.


Here is 2 words from Turkish.

1) 'sıkıcı' which means 'boring' in English (notice the dotless small 
'i's)


2) 'sikici' which means 'fucker' in English
Depends how you normalize. Normalize should sbstitute all *equal* 
letters (or combination thereof) into one single form. That allows 
comparing and matching them.
But yes, even this is very limited (busstop), because even if you know 
the language of the wort (german in my example) you do not know its meaning.


Without a full dictionary, you do not know if ss and german-sharp-s are 
the same or not.
So basically what you want to do, can only be done with a full 
dictionary. Or you have to accept false positives.


I also fail to see why a utf8 string is a half baked solution. It will 
serve most people fine. It can be extended for those who want more.


IMHO this is a case for an add-on library.
And apparently no one has yet volunteered to write it



Now, when you normalize these you get 'SIKICI' for both which --then-- 
you would assume to be the same.



BUT of course there is no way do deal with the ambitious Busstop


In deed. For this case, you need to know what language Busstop was 
written in.
you need a dictionary. knowing it is German is not enough. because all 
that it is german tells you is, that ss maybe a sharp-s, but doesn't 
have to be

What I can not do (or what I do not want to do) is to decide which of
them other people do want to use.

But, isn't this just that: IOW, you're deciding what other people will
NOT want to use if you throw the 'language' attribute (for each char)
out of the window..

True, I am happy to do that. NOT

I am glad we have met :)

have we? I remember a mail conversation, but not an actual meeting :) SCNR

Why you can always extend this. Store you string in any of the following
ways
1) every 2nd char is a language attribute, not a char
2) store the language attributes in a 2nd string, always pass both
strings around


Of course, these and even more creative hacks could be devised.
The question is, is the language an attribute of a unicode character?

(I assume mandatory attribute)

Well as much as it is or is not an attribute of a latin1 or iso-whatever 
char.


I do not think it is. I have no proof. But a lot of people seem to think 
so, if I goggle Unicode (or any other char/latin./iso...) I get nice 
character tables; and no language info.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread listmember


Sorry, but I meant comparing with collation. I did not mean comapring
within labguage context.


How can you do /proper/ collation while ignoring the language context?


1) 'sıkıcı' which means 'boring' in English (notice the dotless small
'i's)

2) 'sikici' which means 'fucker' in English



Depends how you normalize. Normalize should sbstitute all *equal*
letters (or combination thereof) into one single form. That allows
comparing and matching them.


Again, we're not quite on the same page here...

What you're referring is more like 'Text Normalization' [ 
http://en.wikipedia.org/wiki/Text_normalization ] where you do 
definitely need a very comprehensive dictionary so that '1' is equal to 
'one' and '1st' is 'first', etc. (if your language is English).


Whereas, what I am referring to is 'Unicode Normalization' [ 
http://en.wikipedia.org/wiki/Unicode_normalization ].


This one is much narrower in scope. It deals basically with what I can 
refer to as 'character glyphs'.


Now, from what I understand from the definitions of 'Unicode 
Normalization' there are 2 ways of doing it:


1) You decompose both texts (so that you have all 'weird' characters 
ezpanded into their combining characters)


2) You compose both texts (so that, you have as few or no combining 
characters)


This is done, obviously, to get them both in the same format --to make 
life easier to compare.


If you do no other operation on these two texts before you compare them, 
this is called Canonical Equivalnece Test --each 'character glyph' in 
each text must be the same.


For Canonical Equivalnece Test, you do not need to have any 'language' 
attribute --afer all, you're doing a simple byte-wise test.


On the other hand, if you wish to do a broader comparison, 
Compatibility Equivalnece Test or something other, you will need to do a 
little more work on those texts:


Normalization is one of them. I suggest you take a look at the 
'Normalization' heading under 
http://en.wikipedia.org/wiki/Unicode_normalization


Trouble with the 'Normalization' described there is, it is far too crude 
for quite a lot of purposes.


A better form of comparison is, converting both text to either uppercase 
or to lowercase.


And, once we do this, we hit two walls (or obstacles) to overcome. The 
steps I can think of are:


1) Equivalent code points. We need first to 'compose' the text and then 
substitute the relevant (and preferred) equivalent code points for any 
'character glyph's in the texts.


2) We also need to take care of stuff like language dependent case 
transforms. See http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I


As far as I know, this is the only 'proper' thing to do for search and 
comparison operations under unicode.


I know it will be slower, but, that is the price to pay.

Note: The reason I used the term 'character glyphs' is because, several 
codepoint can be combined to make a 'character glyph'.


See the definition of Code Point [ http://unicode.org/glossary/ ] which 
says:


Code Point: Any value in the Unicode codespace; that is, the range of 
integers from 0 to 1016.


As an example, from the above Wiki article, we can use 2 code points to 
produce a 'character glyph', such as


'n' + '~' -- ñ


But yes, even this is very limited (busstop), because even if you know
the language of the wort (german in my example) you do not know its
meaning.


You do not worry about the meaning at all. In all languages (I guess) 
there are several words that may be written the same but mean different 
things.



Without a full dictionary, you do not know if ss and german-sharp-s are
the same or not.


True. But, if you do know it is in German, then you definitely know they 
are. And, this makes a lot of difference.



So basically what you want to do, can only be done with a full
dictionary. Or you have to accept false positives.


Nope. No false positives in text level.

You can always, of course, get false positives in semantic level --such 
as when you're looking for 'apple' (the fruit) and 'Apple' (the brand 
name), but that's a completely different problem.



I also fail to see why a utf8 string is a half baked solution. It will
serve most people fine. It can be extended for those who want more.


I have nothing against UFT-8 or any other encoding schemes. It is just 
that --en encoding scheme. Most handy as a means of transport data from 
one medium/app to another.


But, UFT-8 does in no way cover the whole of Unicode or is a complete 
solution for dealing with unicode. It is, after all, an encoding scheme.



BUT of course there is no way do deal with the ambitious Busstop


Not even if you knew that Busstop was a german string?


In deed. For this case, you need to know what language Busstop was
written in.

you need a dictionary. knowing it is German is not enough. because all
that it is german tells you is, that ss maybe a sharp-s, but doesn't
have to be


A dictionary, then, wouldn't help you either because

Re: [fpc-devel] Unicodestring branch, please test and help fixing

2008-09-12 Thread ABorka


 It is not documented at all. Just like the rest of the database-stuff.
 But maybe I should write a FAQ for fpc. With the new lazarus-versions
 using UTF-8 by default, this is asked quite often.

This would be really nice.

I know I'm not the only one who doesn't want to spend days on hacking 
and debugging the components and FCL code to find out why the database 
field values disappear/morf before reaching my program code when they 
didn't do it before. People will start using these new unicode based 
development tools and this problem will be there for all of them (and 
the problem is not only with the DBAware components but using a simple 
FieldByNameAsString and putting it into a normal control too).


A transparent solution would be the best - like FCL to do conversions 
back and forth automatically from the database codepage when asked to - 
but I guess that is too much to ask for. :) Maybe not even possible.


Thank you for the help guys. Ill try to dig up more info from the 
mailing list archives when I have time.



Joost van der Sluis wrote:

Op vrijdag 12-09-2008 om 15:56 uur [tijdzone +0200], schreef Mattias
Gärtner:

Zitat von Joost van der Sluis [EMAIL PROTECTED]:


Op vrijdag 12-09-2008 om 13:22 uur [tijdzone +0200], schreef JoshyFun:


A Thanks for pointing me to the Lazarus thread about this and the bug
A report. Checked them.
A But as I understand there is no solution available at the moment for

this.

I had partially solved the problem using the handler OnGetText ?
(I'm not sure about the name) for each field which is somehow dirty
forcing a codepage to UTF8 conversion (in Lazarus you will find some
codepage-UTF conversions available).

I think that the original poster didn't looked very well in the
archives, this solution is told here quite often.


A I have a database that is not encoded utf8 (and it will never be because
A other client programs are accessing it and their users do not want/need
A to be converted to unicode). How do I get the field values into
A FPC/Lazarus into a string variable? Right now the non-unicode strings
A are returned as empty from a database field due to FCL conversion

functions.

If you will need this as a fixed solution for this project maybe you
can think in create a new database unit file based in the current one
(change the name of course) with hardcoded UTF8 encoding from codepage
for each string once retrieved from the database. Take care about
string length as UTF8 ones will be equal or longer than the original
ones.

You can just override one single method to do this. This is also told a
few times on this list.

Maybe it is not documented at the right place?


It is not documented at all. Just like the rest of the database-stuff.
But maybe I should write a FAQ for fpc. With the new lazarus-versions
using UTF-8 by default, this is asked quite often.

Joost

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] FPC 2.2.2 on Linux/SPARC

2008-09-12 Thread Mark Morgan Lloyd


Jonas Maebe wrote:

On 11 Sep 2008, at 15:02, Mark Morgan Lloyd wrote:



I've had no success trying to drive fpc interactively across ttys to get
to the point of failure.


Indeed, the tty redirection doesn't work very will in combination with 
raw terminal modes.


After experimentation I was able to do better without redirection but 
logging to file.


If I'm interpreting that correctly then either the command blocks in 
#14 and #13 are wrong when the Pascal code is optimised (I have not 
recompiled the GDB interface library), or find_pc_sect_line is finding 
something it doesn't like. In case it's the former here are the full 
lines:


I don't see what's wrong in the code at first sight. The command in 
frame 14 already seems wrong though, since frame 15 is gdbcon.pp:251, 
wich should pass 'run' rather than 'cd ' to gdb (the 'cd ' stuff comes 
from gdbcon.pp:195).


But I'm not an IDE developer in any way, I'm purely a compiler/rtl 
person. There are few IDE developers left though.


OK, but your recognition of something wrong in the gdb command blocks 
seems to suggest that fpc is mis-compiling  the fp IDE when optimisation 
is enabled. I find that if I copy a non-optimised fp over an optimised 
one (i.e. with fpc and all ppus being optimised) then fp drives the 
debugger properly- at least as far as I've been able to test it.


I've run the test suite for the compiler with and without optimisation, 
so far I don't see any revealing differences in the output but I might 
be overlooking something significant due to inexperience.


--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

[fpc-devel] Can't read file written as result of TFileStream

2008-09-12 Thread Alex Simachov

Hello All,


I'm writing file using TFileStream (0.9.25 lazarus/fpc 2.2.2/mac os x 10.5.4).

Code:
TFileStream.Create(Filename,fmCreate,WRITE_FILEMODE);

But then I can't read it because I don't have any
permissions (ever to read) for this file.

(File is created under /Users/MyName/ directory. )

Any suggestions?

Thanks in advance for answer!

P.S. I'm novice at this list and I've found any FAQ/rules - only note
that this list is intended for technical questions/problems.
So if I should ask at other list - please let me know.

Sincerely yours,
Alex Simachov

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Overloaded Pos bug

Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

Re: Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: Re[2]: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] Unicodestring branch, please test and help fixing

Re: [fpc-devel] FPC 2.2.2 on Linux/SPARC

[fpc-devel] Can't read file written as result of TFileStream

17 matches

Site Navigation

Mail list logo

Footer information