Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

Jonathan Rochkind Mon, 03 May 2010 07:46:08 -0700

Hmm, you could theoretically assign chars in the private unicode area tothe chars you need -- but then have your application replace those charsby small images on rendering/display.

This seems as clean a solution as you are likely to find. Your TEIsolution still requires chars-as-images for these unusual chars, right?So this is no better with regard to copying-and-pasting, browserdisplay, and general interoperability than your TEI solution, but noworse either -- it's pretty much the same thing. But it may be better interms of those considerations for chars that actually ARE currentlyunicode codepoints.

If any of your "private" chars later become non-private unicodecodepoints, you could always globally replace your private codepointswith the new standard ones.

With 137K "private codepoints" available, you _probably_ wouldn't runout. I think. You could try standardizing these "private" codepointsamong people in similar contexts/communities to you and your needs -- itlooks like there are several existing efforts to document shared uses of"private codepoints" for chars that do not have official unicodecodepoints. They are mentioned in the wikipedia article.[Reading that wikipedia article taught me something new I didn't knowabout Marc21 and unicode too -- a topic generally on top of my pilethese days -- "The MARC 21 standard uses the [Private Use Area] toencode East Asian characters present in MARC-8 that have no Unicodeencoding." Who knew? ]


Jonathan

Jakob Voss wrote:

Hi Stuart,
These have been included because they are in widespread use in a currentwritten culture. The problems I personally have are down to charactersused by a single publisher in a handful of books more than a hundredyears ago. Such characters are explicitly excluded from Unicode.
In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.
That is a matter of discussion. If you do not call it 'ligature' chancesare higher to get it included.
To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words
That's interesting and reminds me on the treatment of mathematicalformula in journal titels which mostly end up as ugly images.
In Unicode you are allowed to assign private characters

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters
The U+200D ZERO WIDTH JOINER could also be used but most browsers willnot support it - you need a font that supports your character anyway.
http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx
In summary: Unicode is just a subset of all characters which have beenused for written communication and whether a character gets includeddepends not only on objective properties but on lobbying and othercircumstances. The deeper you dig the more nasty Unicode gets - as allcomplex formats and standards.
Cheers
Jakob
P.S: Michael Kaplan's blog also contains a funny article about emoji:http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx

Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

Reply via email to