Re: Spleen with russian (maybe more) cyrillic symbols

ropers Wed, 06 Oct 2021 09:33:27 -0700

Sorry for the repeated noise, but there's one more thing I stupidly forgot:


Whatever subset of legal Unicode characters might be chosen for
inclusion in minimalistic UTF-8 support, it would be VERY important
for the U+FFFD � REPLACEMENT CHARACTER to be included, because that's
Unicode's way of throwing up its hands and saying "I can't even, like,
deal with you anymore."

<https://en.wikipedia.org/wiki/U+FFFD#Replacement_character>

NB: What Wikipedia says about "[a] poorly implemented text editor"
here also applies --mutatis mutandis-- to the console:
Only substitute U+FFFD replacement characters at the point of actual
console output, i.e. when *displaying* text containing characters not
included in the tiny supported Unicode subset.  Don't actually save
the U+FFFD-substituted text, because that would corrupt otherwise good
data.

(Obviously it would be impossible to use .notdef characters, because
those would not save any glyphs at all but require 1,112,064
additional ones.  If those .notdef "tofu blocks" could be dynamically
generated, things might be different, but compatibility-wise that all
seems very dubious.)

Speaking of compatibility, it's a valid question to ask what should
happen when characters are not necessarily redefinable, e.g. on a
serial console.  I wonder if it might be an option to produce an
inverted "?" as a printable ASCII-safe fallback wherever U+FFFD is
needed.  The VT100 has \33[7m and \33[0m to invert and revert video
for text characters
<https://vt100.net/docs/vt100-ug/chapter3.html#SGR>, but I'm not sure
how consistently those ESCape codes would work everywhere.
If an ESC sequence is not understood, is it just ignored?  If yes,
then that would make "\33[7m?\33[0m" neatly fall back to "?", which
might be a substitute of last resort for U+FFFD.


On 05/10/2021, ropers <rop...@gmail.com> wrote:
> On 05/10/2021, ropers <rop...@gmail.com> wrote:
>> This does relate to a question I've been thinking about for a while,
>> so even if actually offering diffs for that is still way above my pay
>> grade, I will offer these thoughts:
>>
>> * Of ASCII's 128 characters, only 95 are actually printable (ASCII
>> sticks 2 thru 7 minus 0x7F DEL).[0]
>> * In principle, the console is capable of supporting 256 glyphs.
>> * With traditional Extended ASCII (EASCII) character sets, more than
>> 95 characters were (still are) printable, but code assuming the use of
>> ISO 8859-1 is deprecated and no longer portable in this age of UTF-8,
>> and for EASCII sticks 8 thru F, there no longer is a direct
>> correspondence between code points and code units at all.
>> * Even if framebuffer console drivers could hypothetically be altered
>> to allow the use of more than 256 glyphs, I completely agree with Ingo
>> that that would be a fairly terrible idea for various reasons.  While
>> the 256-glyphs limitation does stem from VGA console drivers
>> permitting no more than 256 text mode glyphs (or 512 with hacks), it
>> would be best to not totally break framebuffer and vga console
>> compatibility, but to stay within those limits.
>> * With "extremely minimalistic UTF-8 support", up to 161 "spots" might
>> be available.
>> * There are 1,112,064 legal Unicode character code points (0x11 *
>> 0x10000 - 0x800, i.e. seventeen 65,536-character planes minus the
>> 2,048 code points from U+D800 thru U+DFFF that are reserved for UTF-16
>> surrogates).  Of those, 137,468 are private use, and 66 are
>> non-characters.  If we also subtract the 95 printable ASCII
>> characters, that leaves 974,435 characters that might compete for
>> those 161 spots.
>> * There is an extremely strong argument for accommodating all
>> characters from ISO 8859-1 in any future minimalistic UTF-8 console
>> support.  The non-breaking space and soft hyphen could use the same
>> glyphs as space and hyphen-minus, respectively.  This means that to
>> maintain maximum backwards compatibility and UTF-8
>> forward-portability, 94 of those 161 spots would have to be taken,
>> leaving 67.
>> * There might also be a strong argument for accommodating all the
>> characters from ISO 8859-15 (so an additional 8) and Windows-1252,[1]
>> which despite no Unix pedigree is a common superset of ISO 8859-1,
>> with EASCII sticks A thru F being identical to ISO 8859-1.  ISO
>> 8859-15 differs from ISO 8859-1 in that it includes 8 characters in
>> sticks A/B that Windows-1252 encodes in sticks 8/9.  However, with
>> UTF-8, code units and code points no longer match outside of sticks
>> 0-7, so UTF-8 implementers of ISO 8859-1 and Windows-1252 backwards
>> compatibility get ISO 8859-15 support for free.  Besides those 8,
>> Windows-1252 support would consume an additional 19 characters, so
>> we'd have to subtract 27 from those 27 remaining spots, leaving 40.
>
> s/27 remaining/67 remaining
>
>> * 32 of those spots are from the C0 control codes from ASCII sticks
>> 0/1.  While Bemer et al. did originally propose alternatively
>> printable glyphs for those normally unprintable characters, their
>> glyphs were never commonly used.  If "maximum printability" is a
>> criterion, Unicode does define so-called "Control Pictures" for them
>> (U+2400 thru U+241F).[2]  It conceivably could be useful to have e.g.
>> a console-based hex editor render something printable for most code
>> units, however attempts at Control Picture inclusion would bump
>> against the technical limitation that the Control Pictures glyphs are
>> already barely legible in X11/xterm: So could actually useful Control
>> Pictures glyphs even be defined if one has just 8x16, 8x14, 8x10 or
>> 8x8 pixels to play with, as may be the case on the console?  It seems
>> doubtful.  Perhaps those sparse spots and precious pixels are better
>> spent on something else, like Cyrillic for example.
>> * The once-common DOS code page 437 has 31 alternatively printable
>> glyphs for sticks 0/1.  Of these, only the bullet point, section sign
>> and paragraph mark (pilcrow) can be found in the ISO 8859/Windows-1252
>> family.  There is no compelling reason --like what's mentioned in
>> footnote [1]-- that could motivate the inclusion of its stick 0/1
>> glyphs, DOS having largely gone the way of the dodo.  Also, full CP437
>> support would require many more glyphs, support for just this subset
>
> s/support for/so support for
>
>> of that old code page, never common in Unix-land, would seem wasteful.
>> That still leaves 40 spots that could potentially be used.
>> * The question is, which of the 974,435 candidates deserve one of
>> those 40 spots.  With a look at a relevant map[3], Arabic, Cyrillic,
>> and Indic abugidas might have particularly strong claims.  Arabic has
>> 28 letters, but many contextual variants (though no case), Cyrillic,
>> or more specifically the Russian alphabet has 33 letters and it does
>> have case, so 40 spots might limit any support to UPPER CASE ONLY, or
>> should I say ЦРРЕЯ СА5Е ОИГУ.  I do not feel I know enough about Indic
>> abugidas to say something intelligent.
>> * The question of what subset of Unicode to settle on for minimalistic
>> 256-glyphs-only UTF-8 support might be bigger than OpenBSD.  Other
>> Unix-like OSes might ask themselves the same question.  Is this
>> something that ought to be standardised across Unix-land or something
>> OpenBSD would want to decide on its own?
>> * I mentioned "512 with hacks" above, but I do not know enough if it
>> could be viable, clean and VGA-compatible to blow past that 256
>> boundary.  If yes, then an additional 256 spots might comfortably
>> allow for the inclusion of many more of the above.
>> * Either way, even if no code is created at this time, just having a
>> roadmap and knowing which glyphs ought to make the cut might be
>> desirable.  It would also be possible to already make the font(s) once
>> that is known.  Code that actually uses such a font to implement
>> minimalistic UTF-8 support (for the console) need not arrive at the
>> same time.
>> * On the other hand, if extending our minimum character set to cover
>> Windows-1252 and ISO 8859-15 and especially deciding upon the use of
>> the last 40 spots cannot be settled yet, then it might be fine to
>> leave that for later.  The existing ISO 8859-1 fonts could actually be
>> useable by a minimalistic UTF-8 support implementation, if developed.
>> Again, once such an implementation has code points properly divorced
>> from code units, it could absolutely source its glyphs from those
>> fonts.
>> That would only leave the small issue of UTF-8 compliance by
>> everything else in base and ports...
>>
>> I hope that was useful and worth the verbiage.
>>
>> Thanks for your time,
>> Ian
>>
>> (Ian Ropers)
>>
>> Footnotes:
>> [0] Yes, they're properly called sticks.  8 sticks of 16 characters in
>> ASCII; 16 sticks of 16 characters in EASCII.  See Bob Bemer's Inside
>> ASCII.
>> [1] Per enwp.org/CP1252, Windows-1252 text mislabelled as ISO-8859-1
>> is still very common (online), and "[m]ost modern web browsers and
>> e-mail clients treat the media type charset ISO-8859-1 as Windows-1252
>> to accommodate such mislabeling. This is now standard behavior in the
>> HTML5 specification, which requires that documents advertised as
>> ISO-8859-1 actually be parsed with the Windows-1252 encoding."
>> [2] https://en.wikipedia.org/wiki/Control_Pictures
>> [3] https://en.wikipedia.org/wiki/File:Writing_systems_worldwide.png
>>
>>
>> On 05/10/2021, Ingo Schwarze <schwa...@usta.de> wrote:
>>> Hi Slava,
>>>
>>> Slava Voronzoff wrote on Tue, Oct 05, 2021 at 03:01:26PM +0300:
>>>
>>>> I'm working right now on adding cyrillic to Spleen font. How can I
>>>> later
>>>> add it to OpenBSD kernel and ports? Pull request to main font on github
>>>> (Hi, Frederic) or patch here?
>>>
>>> You cannot add it to the kernel because the kernel does not support
>>> UTF-8, but only US-ASCII, and US-ASCII contains no code points for
>>> cyrillic letters.
>>>
>>> Full UTF-8 support is definitely not wanted in the kernel.  Adding
>>> extremely minimalistic UTF-8 support to the kernel is not completely
>>> out of the question, but some developers are likely to feel sceptic even
>>> about that.  Consequently, trying to pursue a project of adding anything
>>> related to UTF-8 to the kernel is likely to end in frustration if the
>>> person trying that does not have a significant amount of experience with
>>> getting OpenBSD kernel patches committed.
>>>
>>> I'm sorry that i know absolutely nothing about fonts in ports, maybe
>>> someone else can answer that part of the question.
>>>
>>> Yours,
>>>   Ingo
>>>
>>>
>>
>

Re: Spleen with russian (maybe more) cyrillic symbols

Reply via email to