Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

stuart yeates Sun, 02 May 2010 13:16:12 -0700

Jakob Voss wrote:

Eric Hellman wrote:
May I just add here that of all the things we've talked about in
these threads, perhaps the only thing that will still be in use a
hundred years from now will be Unicode. إن شاء الله
Stuart Yeates wrote:

 > Sadly, yes, I agree with you on this.
 >
 > Do you have any idea how demotivating that is for those of us
 > maintaining collections with works containing characters that don't
 > qualify for inclusion?
May I just add there that Unicode is evolving too and you can help toget missing characters included. One of the next updates will eveninclude hundreds of icons such as a slice of pizza, a kissing couple,and the mount Fuji (See this zipped PDF: http://is.gd/bABl9 andhttp://en.wikipedia.org/wiki/Emoji).


Indeed.

These have been included because they are in widespread use in a currentwritten culture. The problems I personally have are down to charactersused by a single publisher in a handful of books more than a hundredyears ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.

To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words

cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/       New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/     Institutional Repository

Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

Reply via email to