Re: [ProofPower] "Unicode" vs. UTF8

Makarius Tue, 07 Nov 2023 02:11:02 -0800

On 06/11/2023 02:19, Rob Arthan wrote:


One bit of burrowing inside instigated by the developers rather than the users 
began a few years ago: an experimental port of ProofPower to use Unicode and 
UTF-8 for mathematical symbols in place of the ad hoc character set that we had 
to invent for ourselves years ago. Roger Jones did some excellent work towards 
this end and made a version of the document processing tools that would compile 
all the ML code from UTF-8 sources using the old character set as an 
intermediary. I ported the user interface program xpp based on his work, but it 
has proved problematic to port all that we did to Linux: we were developing on 
Mac OS and, sadly, the Free BSD wchar_t libraries are much less restrictive 
than the Gnu libraries and it seems to need a lot more work on the document 
processing utilities (that are written in C) to get round this.

I first heard of "Unicode" approx. 1990. Everybody was talking about "widecharaters" (16bit) and made moves to support them (e.g. major Unix vendors andMicroSoft). That was the time of now obsolete UCS16: "16bit should besufficient for everyone, i.e. every imaginable character on the planet".

Many years later, everybody started to use UTF-8 and the meaning "Unicode"changed quite a bit. The "UTF-8 manifesto" http://utf8everywhere.org helps tounderstand the situation in retrospective. In particular, "wide characters"are now mostly obsolete, although we still see them as building blocks forUTF-16 (which is not the same sas 16-bit Unicode). Programming languages withwide characters carry unnecessary baggage. E.g. see this side remark(concerning Python) in the text:

"""

Furthermore, we would like to suggest that counting or otherwise iteratingover Unicode code points should not be seen as a particularly important taskin text processing scenarios. Many developers mistakenly see code points as akind of a successor to ASCII characters. This lead to software designdecisions such as Python’s string O(1) code point access. The truth, however,is that Unicode is inherently more complicated and there is no universaldefinition of such thing as Unicode character. We see no particular reason tofavor Unicode code points over Unicode grapheme clusters, code units orperhaps even words in a language for that. On the other hand, seeing UTF-8code units (bytes) as a basic unit of text seems particularly useful for manytasks, such as parsing commonly used textual data formats. This is due to aparticular feature of this encoding. Graphemes, code units, code points andother relevant Unicode terms are explained in Section 5. Operations on encodedtext strings are discussed in Section 7.

"""

(Maybe this helps to prove that the mailing list is alive.)


        Makarius


_______________________________________________
Proofpower mailing list
[email protected]
http://lemma-one.com/mailman/listinfo/proofpower_lemma-one.com

Re: [ProofPower] "Unicode" vs. UTF8

Reply via email to