Re: [fricas-devel] Re: TeXmacs format contribution

Waldek Hebisch Tue, 08 Mar 2011 15:00:28 -0800

Martin Baker wrote:

> > 1) It seems that you really ask for "complete" support for Unicode
> > (or at least mathematical subset of Unicode).  Namely, for
> > something better than hack above we really need a worked out
> > design.
> 
> Exactly!
> 
> > For processing Unicode is done and really I see
> > no reason to design something different.
> 
> I don't quite understand this. Just so that I am completely clear
> about this, are you saying that SPAD and Lisp can handle Unicode but
> command line I/O is the only thing preventing this?
>


Not exactly.  First, problem of processing "rich" charcter data
appeared long ago and various appraches were tried.  One worth
mentioning was based on codepage switching: there are special
seqences of characters which signal change of used character
set.  That way it is possible to switch between 8-bit and
16-bit encodings (and theoretically 24-bit (and more) are
possible).  And even using 8-bit encodings it is possible to
use rich character set.  Unicode movement began due to
dissatifaction with codepage switching -- one of design goals
of Unicode was to avoid escape sequences and/or codepage switching.
In other words Unicode characters were supposed to have the
same meaning regardless of neighbouring characters -- note
that when escape seqences are in use you need to scan the whole
string to determine if given character is part of escape
seqence (or it meaning is modified by escape seqence).
So, what is wrong with escape seqences?  Well, due to
stateful nature they very much complicate processing.  To
get some feel of difficulties you may look at recent
changes to axiom/fricas script.  The task was trivial:
get character string from one command line and put it
in another command line.  The problem was that on the
way escape seqences where transformed ("expanded") by
shell and we had to effectively undo this transformation.
The resulting code is not very complicated, but went
trough few buggy iterations.

Why the rant above: as I wrote escape seqences are
necessary evil when communicating via limited media,
but should be avoided (at leat logically) during
processing (note that with escape seqences trivial tasks,
like replacing all occurences of substing 'pha' in given
string by some other substring becomes not so trivial).
For processing one should use format which is more or
less free from escape seqences.  Unicode seem to be
quite good in this respect.  Now, Unicode is not
entirely free from escape seqences, there are so
called combing characters.  Moreover UTF-8 uses
multiple octets to encode single Unicode codepoint
and conseqently meaning of given octet in UTF-8
encoded string may depend on previous octets.
However, Unicode (and UTF-8) were specially designed
to avoid various bad effects.  For example, the
second octet of multioctet character can not occur
alone: just seeing this octet you know that there
must be octet before belonging to the same character.

Writing "Unicode is done" I meant that it makes no sense to
invent a system of escape seqence to encode special characters
_during processing_.  Note: character entities on Web pages
logically are purely input/output mechanizm.  Logically when
browser gets a page character entities are replaced by characters.
Then page is parsed to get a tree and all interesting things
happen at the level of parse tree (or above).

Concerning Unicode in Spad: UTF-8 can be used in any 8-bit
clean system.  In particular using UTF-8 we can process
Unicode in non-Unicode aware Lisp.  And Unicode aware Lisp
by definition can handle Unicode.

There are three problems with this.  One is that all
Unicode aware Lisps use UTF-32 encoding which is different
than UTF-8.  Conseqently we need different code when
using Unicode aware Lisp compared to non-Unicode aware Lisp.
The second problem is that Lisp typically performs some recoding
on input/output and rejects invalid encodings.  This is a problem
because at least theoretically needed encoding may be not
installed.  Also, at least clisp does not allow changing
encoding on already open stream, which means that for example
on standard input we are stuck with encoding which was
active when clisp started up.  The third problem is that
current Spad character type is 8-bit.  This means that
using UTF-32 encoding we currently can not extract characters
from them without risking out-of-range character codes
(but if instead of characters one uses one-character long
strings things should work fine).  For UTF-8 8-bit
"characters" would work, but actually they whould not
be Unicode characters but octets.

Bottom line: to use Unicode we need to:

- decide what to do with Spad character type (extending it to
  21 bits needed for UTF-32 looks trivial, but there may be
  some hidden gotcha).  In particular we need to decide if
  Spad character correspond to Unicode code point or to
  units of encoding (that is octets in case of UTF-8).
- normalize ways of creating Unicode strings/characters.  In
  particular how to represent them in source code.

One possible way is to declare that we support Unicode only
in Unicode aware Lisps, extend Spad character type to 21 bits
and invent some notation to put Unicode characters in strings
(and maybe identifiers).

> Although this is non-ideal, I can certainly live with this issue (by
> using the hacks that you mentioned and if required I can hack my own
> local copy of html formatter) however I would be quite surprised if I
> am the only one who has wanted to uses non-ascii characters with
> FriCAS. Is there a list of known issues for new users where all these
> workarounds could be put in one place?
> 
> Come to think of it, is there a wish-list for FriCAS? I could think of
> a long list, but perhaps I should keep quiet, I suspect you won't
> agree with most of them?

I certainly want to know what wishes other have.  As Bill wrote
good place to put them is Axiom wiki.

-- 
                              Waldek Hebisch
[email protected] 

-- 
You received this message because you are subscribed to the Google Groups 
"FriCAS - computer algebra system" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/fricas-devel?hl=en.

Re: [fricas-devel] Re: TeXmacs format contribution

Reply via email to