Re: [EPIC] What is a space? And who is using a non-C locale?

Ben Winslow Mon, 29 Nov 2004 16:08:54 -0800

I'm more than a little late here (I've been rather busy lately), and the
decision has already been made; I don't have any problems with that
description, but I'll finally weigh in here and make some comments to
share the information (since I'm apparently the EPIC Unicode expert. ;)

On Mon, 2004-11-08 at 15:35 -0600, Jeremy Nelson wrote:
> So one of the things that has come up during the epic4 vote is how
> spaces are handled inconsistently throughout epic.  I had originally
> intended to address this in epic5, but it looks like this is a real
> problem for /xdebug extractw users.
> 
> In the C locale, characters 9 (^I), 10 (^J), 11 (^K), 12 (^L), 13 (^M), 
> and 32 (space) are considered "spaces".  That means isspace(x) returns 1
> for any x = one of those characters, and 0 for everything else.  
> 
> EPIC has three ways to determine "spaces"
> 
> 1) Use the system's isspace(), which could be locale dependant.
> 2) Use epic's my_isspace() which behaves exactly the same as isspace()
>    does in the C locale.
> 3) Just compare the character against character 32 (space).
> 
> It would be best if we just had one way, because that would be less 
> confusing.  But this is a problem because in some places in epic, a
> tab is a space, and in other places, it isn't.  So if I just switch
> everything to use isspace(), then some scripts might break in ways
> that I can't anticipate.
> 
> I have two questions:
> 
> *) Are any of you using tabs, newlines, carriage returns, etc, in any
>    way that depends on them not being a "space" character in some context?
>    If you are, you need to be a participant in this discussion!
> 
> *) Are any of you using a non-C locale?  (If you don't know, then you are
>    not)  Does your locale have a different set of space characters?

I'm using en_US.UTF-8 (US English, UTF-8 charset, of course); in this
locale, with glibc at least, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, and 0x20 are
all returned as spaces by isspace() (this matches the C locale.)
Unicode has several additional spacing characters (especially since it
adds several typographic spaces), but isspace() seems to ignore these
(well, only 0xa0 (non-breaking space) falls within the range isspace()
can handle), presumably so that old applications won't see different
behavior.

> Feedback for these two questions is greatly appreciated and may reduce
> near term pain and suffering. 

Part of the underlying problem here is that different people use
different character sets everywhere; for example, the author of a script
pack you want to use may be using Shift-JIS for Japanese, while you're
using the near-ubiquitous-for-English ISO8859-1.  Obviously, it's not
desirable for the script to (mal)function differently just because
you're using a different locale, so consistent script behavior is
desirable.

The second part of the underlying problem is that there is no difference
between the character behavior for scripts and the rest of EPIC,
including user input; if you said 'Hello<EN SPACE>world!' to a channel,
and a script tried to parse that, there's no way for EPIC to decide
"this is a user's input, so we should treat it according to the locale's
rules" or "this is an internal part of a script, so we want to be
certain it functions the same way everywhere."

If you wanted to address this problem (which is obviously something
that's not in the scope of EPIC4), there are a couple of approaches that
are immediately obvious:

1) Define a certain character set in which all scripts are parsed.  I
imagine that any system on which EPIC still compiles has sufficient
locale support to handle this.
Pro: Existing functions could be made locale-aware without scripts
breaking between different locales.
Con: Moderately inconvenient if your favorite text editor doesn't
support on-the-fly character translation.
Con: May be a bit cumbersome to implement--it's not as simple as using
iconv() to translate the script from some known charset to the current
locale's charset, since the script locale may have characters that
cannot be represented in the current locale.

2) When/if EPIC is ever properly locale aware, add separate locale-aware
versions of several string functions, enabling scripts to decide on
their own where locale-specific spaces should have meaning.
Pro: Fairly simple to implement.  Basically approach #1, with the script
charset being US-ASCII and all bytes above 127 being left as-is.
Con: Scripters have to specifically support the locale-specific
functions in their scripts for things to Just Work for an end-user.
Con: The addition of several locale-aware text functions would enlarge
the function table and increase lookup times.

> Jeremy

As I said, this is all pretty well out of the scope of EPIC4; however, I
hope that this information will be helpful during the development of
EPIC5, especially if proper locale support is decided to be a Good Idea.

-- 
Ben Winslow <[EMAIL PROTECTED]>

signature.asc
Description: This is a digitally signed message part

_______________________________________________
List mailing list
[EMAIL PROTECTED]
http://epicsol.org/mailman/listinfo/list

Re: [EPIC] What is a space? And who is using a non-C locale?

Reply via email to