Mattias Gaertner wrote / napísal(a):
On Fri, 25 Jan 2008 17:25:07 +0100
Ales Katona <[EMAIL PROTECTED]> wrote:
Mattias Gärtner wrote / napísal(a):
The character sets in synedit are 'set of char', which means only
8bit. So, I guess the patch tries to fix an ANSI codepage accented
chars problem, right?
The fix is probably useless on other codepages including UTF-8,
right?
Not as such. The problem is two fold.
1. If we ignore encoding (eg: just work in ansi space), then the old
style was simply plain wrong. It only allowed alpha (not num) chars,
and worked on the principle of "what's not alpha, isn't a word".
True. But at least it is reliable.
I disagree here, the functions were a bit incosistent (eg: usage of info
from highlighter) before.
For what codepages do the patch work and for what codepages does it
not work?
Right now it depends on wether you have a highlighter assigned. If yes
then the highlighters' WordBlockChars set is used, otherwise the
TSynWordBlockChars constant is used (which currently is also assigned to
all highlighters anyhow afaik).
The constant currently contains only utf-8 "blocks" so it's more or less
futureproof on this front.
Maybe the set/check should be configurable. The IDE will eventually
only pass UTF-8 to synedit. Then we need an UTF-8 word boundary test.
Yes, and I'm not 100% sure of what everything that would constitute (eg:
I don't think there's a valid blockchar in multibyte range), but for 99%
of usages the current blockchars (+ whitechars) which are < 127 seem to
be working fine.
2. If we also consider UTF-8 encoded content, then getting words by
boundaries (eg: not-allowed chars) and not by allowed-chars means
that as long as given boundaries and whitespaces are < 127 (which the
default ones are), UTF-8 words will be parsed right, even if they
contain special multibyte chars.
I'm not sure if #2 applies also to some other encoding.
UTF-8 uses #128..#255. #0..#127 is plain ASCII like most other
8-bit codepages.
Yes, that's exactly why using blocks instead of allowed chars is better
IMHO (since block chars are 127-, parsing by them allows utf-8 words to
work now in synedit, whereas before the patch, any accented char would
end up being taken as a block).
The patch isn't trying to fix synedit to be utf-8 valid, but at least it
enables word-parsing (boundaries, nextwordpos etc.) on utf-8 text now,
charset set to UNICODE (I tested with slovak which contains 2byte stuff
only tho). And lazarus didn't stop working either :D
Mattias
Ales
P.S: I think synedit will need a lot more work to be 100% "utf-8 ready"
on all fronts. All the "set of char" things will have to go and we'd
have to implement utf-8 "utf8string[x]" operations/functions (afaik fpc
doesn't have them yet?)
_________________________________________________________________
To unsubscribe: mail [EMAIL PROTECTED] with
"unsubscribe" as the Subject
archives at http://www.lazarus.freepascal.org/mailarchives
_________________________________________________________________
To unsubscribe: mail [EMAIL PROTECTED] with
"unsubscribe" as the Subject
archives at http://www.lazarus.freepascal.org/mailarchives