Mattias Gaertner  wrote / napísal(a):
On Fri, 25 Jan 2008 17:25:07 +0100
Ales Katona <[EMAIL PROTECTED]> wrote:

Mattias Gärtner  wrote / napísal(a):
The character sets in synedit are 'set of char', which means only
8bit. So, I guess the patch tries to fix an ANSI codepage accented
chars problem, right?
The fix is probably useless on other codepages including UTF-8,
right?
Not as such. The problem is two fold.

1. If we ignore encoding (eg: just work in ansi space), then the old style was simply plain wrong. It only allowed alpha (not num) chars,
and worked on the principle of "what's not alpha, isn't a word".

True. But at least it is reliable.
I disagree here, the functions were a bit incosistent (eg: usage of info from highlighter) before.
For what codepages do the patch work and for what codepages does it
not work?
Right now it depends on wether you have a highlighter assigned. If yes then the highlighters' WordBlockChars set is used, otherwise the TSynWordBlockChars constant is used (which currently is also assigned to all highlighters anyhow afaik).

The constant currently contains only utf-8 "blocks" so it's more or less futureproof on this front.
Maybe the set/check should be configurable. The IDE will eventually
only pass UTF-8 to synedit. Then we need an UTF-8 word boundary test.


Yes, and I'm not 100% sure of what everything that would constitute (eg: I don't think there's a valid blockchar in multibyte range), but for 99% of usages the current blockchars (+ whitechars) which are < 127 seem to be working fine.
2. If we also consider UTF-8 encoded content, then getting words by boundaries (eg: not-allowed chars) and not by allowed-chars means
that as long as given boundaries and whitespaces are < 127 (which the
default ones are), UTF-8 words will be parsed right, even if they
contain special multibyte chars.

I'm not sure if #2 applies also to some other encoding.

UTF-8 uses #128..#255. #0..#127 is plain ASCII like most other
8-bit codepages.
Yes, that's exactly why using blocks instead of allowed chars is better IMHO (since block chars are 127-, parsing by them allows utf-8 words to work now in synedit, whereas before the patch, any accented char would end up being taken as a block).

The patch isn't trying to fix synedit to be utf-8 valid, but at least it enables word-parsing (boundaries, nextwordpos etc.) on utf-8 text now, charset set to UNICODE (I tested with slovak which contains 2byte stuff only tho). And lazarus didn't stop working either :D

Mattias
Ales

P.S: I think synedit will need a lot more work to be 100% "utf-8 ready" on all fronts. All the "set of char" things will have to go and we'd have to implement utf-8 "utf8string[x]" operations/functions (afaik fpc doesn't have them yet?)
_________________________________________________________________
     To unsubscribe: mail [EMAIL PROTECTED] with
                "unsubscribe" as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Reply via email to