Hi,
MULE_INTERNAL solved a really hard problem years ago and must have
been extremely useful, but I think we might be able to drop it now,
and I have a patch. If I am wrong about that and there are users who
would object, then we should probably improve it instead, and I have
some ideas (part of larger reworkings), but first I'd like to
establish whether it is already completely obsolete.
This history may be very well known to hackers in Japan, but I had to
start from zero with my archeologist hat on, and I suspect this is as
obscure to many others as it was to me, so here's what I have come up
with:
In the early nineties (perhaps beginning in the late 80s?),
researchers at AIST developed the MULE "meta-encoding" for Nemacs
(Nihon Emacs), later merged into Xemacs and GNU Emacs. Unlike early
UTF-16-only versions of Unicode, Emacs' internal encoding was
multi-byte and backward-compatible with ASCII and traditional
in-memory and on-disk representations of text. Aside from lacking a
multi-byte encoding, early versions of Unicode also perhaps failed to
cover all CJK characters needed for information systems of the time,
apparently.
It's a simple and clever idea, just messy in the details and a little
inefficient: each byte was either ASCII or a lead byte that says which
encoding follows (perhaps with light reencoding/escaping in some
cases, IDK), so except for ASCII, it was always less efficient by at
least one byte than whatever it wraps, but there was nothing it
couldn't handle. It could mix around 41 encodings this way, so for
the first time you could have (say) Chinese and Arabic in one document
in a multi-byte format compatible with traditional conventions.
The idea doesn't seem to have been adopted by any other software
except PostgreSQL (at least that I could find in quick searches, I'd
be interested to hear of any others). That's probably because Unicode
gained UTF-8 only a bit later in 1993, providing the missing
multi-byte encoding. Instead of referencing 41 other moving
standards, it was one unified standard with full international
industry support, and neatly fitted into the C strings and existing
text file conventions (not to mention other design goals like
self-synchronisation). The rest is history.
Our implementation of MULE_INTERNAL only supports a few sub-encodings,
for Latin, Cyrillic, Chinese, Japanese and Korean, and hasn't been
updated to support modern versions of the CJK ones (ie when we got
EUC_JIS_2004, we didn't handle the corresponding MULE_INTERNAL lead
byte, and I haven't checked the Chinese or Korean situation), which I
suspect might be an actionable clue that it is not in use... but I
lack the context to say, that's a hypothesis. Our code references the
Xemacs project's internals documentation, last published in 1997, with
a note added in 2012 that we'd started following GNU's implementation
instead, which I think means that mule-conf.el[1] is the closest thing
to a standard. We added some more IDs as they were assigned, but they
remain unimplemented. (If we actually do need to keep this, perhaps
our implementation could dispatch to our "direct" encoding routines
instead of open-coding the sub-encodings? That might be hopelessly
naive and I can see the combination problem we have and they don't
since 23, they only convert to/from Unicode, it's just a thought, but
I think something like that would be more like what Emacs is doing
IIUC.)
Modern GNU Emacs switched to using UTF-8 internally[2] as of Emacs 23
(2009). It can still convert what it calls "Emacs 21 internal format"
when loading a file, but I suspect we might be the last ones to
support the idea directly as an internal representation.
Emacs' internal representation (both old and new) is a technically a
superset of Unicode, as they are proud to say, but AFAICT that just
means you're free to map your made up script's made up encoding into
the 5-byte UTF-8 sequence space not used by Unicode (or in the old
system, using private lead bytes), not anything actually useful for
our purposes. And if you just want to put your Klingon or Tolkien
elvish homework into PostgreSQL, see the ConScript Unicode Registry,
it'd use less disk space! More seriously, I think there have been
periods when eg JIS rolled out a new standard with characters that
Unicode didn't have yet. Unicode simply added them to a minor release
(eg 3.2), but for a short time you could have said that Unicode was
not a superset or theoretically sufficient. On the other hand,
PostgreSQL wouldn't stop you using such hypothetical characters
anyway: our UTF-8 validation is for well-formedness, not definedness.
There may of course be implications for sorting and classifying, but
all of that seems a bit bogus: we stopped updating MULE_INTERNAL even
for Japanese, we routinely upgrade Unicode, and locales never worked
for MULE_INTERNAL anyway. I also doubt very much that Unicode would
be out of the loop on new character assignments in modern times.
As for interchange and system boundaries, (1) standard locales on real
systems don't come in MULE_INTERNAL encodings so none of that stuff
works, (2) the JDBC driver and presumably any driver/language that has
its own firm ideas about strings can't support it either, (3) even
applications using libpq would be hard pressed to know what text
actually means outside ASCII, if they choose it as a client encoding,
except perhaps Emacs if you're lucky.
The motivation for removing it would be the unnecessary security
risks, and maintenance burden for future development in our encoding
and locale support. The motivation for keeping it would be that there
are users with important data trapped in it. In the absence of hard
data, I tried to imagine why you'd want to use it, other than perhaps
just "we needed it in 199x and haven't migrated yet". I don't know
too much about CJK computing but I am aware of the space issue:
commonly used CJK characters take 3 UTF-8 bytes to represent, one more
than the national EUC_* encodings. That's a motivation for preferring
EUC_*, but let's see how MULE_INTERNAL compares:
kanji kana
MULE_INTERNAL-wrapped-JISX0208/0212: 3 3
MULE_INTERNAL-wrapped-JISX0201K: N/A 2
UTF8: 3 3
EUC_JP: 2 2
EUC_JIS_2004: 2 2
Since there are two encodings for kana characters and MULE's
superpower is to switch, I guess it depends how you chose to encode it
and what your ratio of kana to kanji is. Google gives me a first
guess of 50/50. I see that the sjis2mic() conversion is clever enough
to use JISX0201K for kana, so if your client is speaking SJIS then I
suppose you might actually finish up with around ~2.5 bytes per
character. That's smaller than UTF-8, and larger than EUC_*. On the
other hand, EUC_JIS_2004 handles more Japanese characters, and UTF-8
handles all of the world's scripts. So *maybe* there is a small
motivation there, depending on what you think about JIS 2004. I
somehow doubt the trade-off makes sense in practice though, you'd be
forever dealing with weird problems when some guy called, to pick an
example character I googled that is common but missing in the older
standard, "凜" needs to appear in your data, if I understood all of
that correctly.
For Chinese, the calculus is simpler as they only use hànzì (~=
kanji), nothing potentially smaller like kana to affect the average.
For Korean, I have no clue.
Can any Japanese (or other) experts offer any clues? Concrete questions:
* Is anyone actually using MULE_INTERNAL today?
* If so, what prevented migration?
* Was it ever actually used outside Japan?
* Is the lack of interest in the new (22 year old) JIS standard in
MULE_INTERNAL meaningful?
[1]
https://github.com/emacs-mirror/emacs/blob/master/lisp/international/mule-conf.el
[2]
https://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html