I'm more than a little late here (I've been rather busy lately), and the decision has already been made; I don't have any problems with that description, but I'll finally weigh in here and make some comments to share the information (since I'm apparently the EPIC Unicode expert. ;)
On Mon, 2004-11-08 at 15:35 -0600, Jeremy Nelson wrote: > So one of the things that has come up during the epic4 vote is how > spaces are handled inconsistently throughout epic. I had originally > intended to address this in epic5, but it looks like this is a real > problem for /xdebug extractw users. > > In the C locale, characters 9 (^I), 10 (^J), 11 (^K), 12 (^L), 13 (^M), > and 32 (space) are considered "spaces". That means isspace(x) returns 1 > for any x = one of those characters, and 0 for everything else. > > EPIC has three ways to determine "spaces" > > 1) Use the system's isspace(), which could be locale dependant. > 2) Use epic's my_isspace() which behaves exactly the same as isspace() > does in the C locale. > 3) Just compare the character against character 32 (space). > > It would be best if we just had one way, because that would be less > confusing. But this is a problem because in some places in epic, a > tab is a space, and in other places, it isn't. So if I just switch > everything to use isspace(), then some scripts might break in ways > that I can't anticipate. > > I have two questions: > > *) Are any of you using tabs, newlines, carriage returns, etc, in any > way that depends on them not being a "space" character in some context? > If you are, you need to be a participant in this discussion! > > *) Are any of you using a non-C locale? (If you don't know, then you are > not) Does your locale have a different set of space characters? I'm using en_US.UTF-8 (US English, UTF-8 charset, of course); in this locale, with glibc at least, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, and 0x20 are all returned as spaces by isspace() (this matches the C locale.) Unicode has several additional spacing characters (especially since it adds several typographic spaces), but isspace() seems to ignore these (well, only 0xa0 (non-breaking space) falls within the range isspace() can handle), presumably so that old applications won't see different behavior. > Feedback for these two questions is greatly appreciated and may reduce > near term pain and suffering. Part of the underlying problem here is that different people use different character sets everywhere; for example, the author of a script pack you want to use may be using Shift-JIS for Japanese, while you're using the near-ubiquitous-for-English ISO8859-1. Obviously, it's not desirable for the script to (mal)function differently just because you're using a different locale, so consistent script behavior is desirable. The second part of the underlying problem is that there is no difference between the character behavior for scripts and the rest of EPIC, including user input; if you said 'Hello<EN SPACE>world!' to a channel, and a script tried to parse that, there's no way for EPIC to decide "this is a user's input, so we should treat it according to the locale's rules" or "this is an internal part of a script, so we want to be certain it functions the same way everywhere." If you wanted to address this problem (which is obviously something that's not in the scope of EPIC4), there are a couple of approaches that are immediately obvious: 1) Define a certain character set in which all scripts are parsed. I imagine that any system on which EPIC still compiles has sufficient locale support to handle this. Pro: Existing functions could be made locale-aware without scripts breaking between different locales. Con: Moderately inconvenient if your favorite text editor doesn't support on-the-fly character translation. Con: May be a bit cumbersome to implement--it's not as simple as using iconv() to translate the script from some known charset to the current locale's charset, since the script locale may have characters that cannot be represented in the current locale. 2) When/if EPIC is ever properly locale aware, add separate locale-aware versions of several string functions, enabling scripts to decide on their own where locale-specific spaces should have meaning. Pro: Fairly simple to implement. Basically approach #1, with the script charset being US-ASCII and all bytes above 127 being left as-is. Con: Scripters have to specifically support the locale-specific functions in their scripts for things to Just Work for an end-user. Con: The addition of several locale-aware text functions would enlarge the function table and increase lookup times. > Jeremy As I said, this is all pretty well out of the scope of EPIC4; however, I hope that this information will be helpful during the development of EPIC5, especially if proper locale support is decided to be a Good Idea. -- Ben Winslow <[EMAIL PROTECTED]>
signature.asc
Description: This is a digitally signed message part
_______________________________________________ List mailing list [EMAIL PROTECTED] http://epicsol.org/mailman/listinfo/list
