Re: synedit patch from ales

Ales Katona Fri, 25 Jan 2008 09:37:55 -0800

Mattias Gaertner  wrote / napísal(a):

On Fri, 25 Jan 2008 17:25:07 +0100
Ales Katona <[EMAIL PROTECTED]> wrote:

Mattias Gärtner  wrote / napísal(a):

The character sets in synedit are 'set of char', which means only
8bit. So, I guess the patch tries to fix an ANSI codepage accented
chars problem, right?
The fix is probably useless on other codepages including UTF-8,

right?

Not as such. The problem is two fold.

1. If we ignore encoding (eg: just work in ansi space), then the oldstyle was simply plain wrong. It only allowed alpha (not num) chars,

and worked on the principle of "what's not alpha, isn't a word".


True. But at least it is reliable.

I disagree here, the functions were a bit incosistent (eg: usage of infofrom highlighter) before.

For what codepages do the patch work and for what codepages does it
not work?

Right now it depends on wether you have a highlighter assigned. If yesthen the highlighters' WordBlockChars set is used, otherwise theTSynWordBlockChars constant is used (which currently is also assigned toall highlighters anyhow afaik).

The constant currently contains only utf-8 "blocks" so it's more or lessfutureproof on this front.

Maybe the set/check should be configurable. The IDE will eventually
only pass UTF-8 to synedit. Then we need an UTF-8 word boundary test.

Yes, and I'm not 100% sure of what everything that would constitute (eg:I don't think there's a valid blockchar in multibyte range), but for 99%of usages the current blockchars (+ whitechars) which are < 127 seem tobe working fine.

2. If we also consider UTF-8 encoded content, then getting words byboundaries (eg: not-allowed chars) and not by allowed-chars means
that as long as given boundaries and whitespaces are < 127 (which the
default ones are), UTF-8 words will be parsed right, even if they
contain special multibyte chars.

I'm not sure if #2 applies also to some other encoding.


UTF-8 uses #128..#255. #0..#127 is plain ASCII like most other
8-bit codepages.

Yes, that's exactly why using blocks instead of allowed chars is betterIMHO (since block chars are 127-, parsing by them allows utf-8 words towork now in synedit, whereas before the patch, any accented char wouldend up being taken as a block).

The patch isn't trying to fix synedit to be utf-8 valid, but at least itenables word-parsing (boundaries, nextwordpos etc.) on utf-8 text now,charset set to UNICODE (I tested with slovak which contains 2byte stuffonly tho). And lazarus didn't stop working either :D


Mattias

Ales

P.S: I think synedit will need a lot more work to be 100% "utf-8 ready"on all fronts. All the "set of char" things will have to go and we'dhave to implement utf-8 "utf8string[x]" operations/functions (afaik fpcdoesn't have them yet?)

_________________________________________________________________
     To unsubscribe: mail [EMAIL PROTECTED] with
                "unsubscribe" as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives


_________________________________________________________________
    To unsubscribe: mail [EMAIL PROTECTED] with
               "unsubscribe" as the Subject
  archives at http://www.lazarus.freepascal.org/mailarchives

Re: synedit patch from ales

Reply via email to