Re: UTF-8 locale & POSIX text model

k...@keldix.com Sun, 26 Nov 2017 04:44:55 -0800

On Sun, Nov 26, 2017 at 02:09:21AM +0000, Danny Niu wrote:
> 
> 
> On 26 Nov 2017, at 3:53 AM, k...@keldix.com<mailto:k...@keldix.com> wrote:
> 
> On Wed, Nov 22, 2017 at 05:43:51PM +0000, Stephane Chazelas wrote:
> 2017-11-22 16:27:15 +0100, Martijn Dekker:
> Op 22-11-17 om 16:02 schreef Geoff Clare:
> Danny Niu <danny...@hotmail.com<mailto:danny...@hotmail.com>> wrote, on 22 
> Nov 2017:
> 
> Q1: What is the rationale for not making POSIX an application of ASCII?
> 
> So that systems which use other encodings (specifically EBCDIC) can
> be POSIX-conforming.  IBM z/OS is certified UNIX 95 and uses EBCDIC.
> 
> But then how should I interpret the table in 6.1 Portable Character Set,
> particularly the UCS column?
> [...]
> 
> It just says those characters are the one constituting the
> portable character set. It doesn't specify the encoding other
> than it mandates the encoding of those characters to be
> invariant in the charsets in the system's supported locales.
> 
> Well, for EBCDIC this does not hold true over different national variants.
> For example dollar is coded x5b in IBM038 and coded x67 in IBM277.
> 
> For a POSIX system to have a locale where UTF-8 is the charset,
> that means that any other locale charset would have to have the
> same encoding for those characters in the portable character
> set, which happens to be the same as ASCII. That doesn't mean
> that the C locale's charset would have to be a superset of
> ASCII, but that it would have to match ASCII on all the
> characters of the portable character set.
> 
> A EBCDIC-based system cannot have a locale where UTF-8 is the
> charset as there are many characters from the portable character
> set that don't have the same encoding in EBCDIC and UTF-8.
> 
> Depends on the specification on conformance. I think we should lift
> the constraint that all charsets in a given implementation need to have the
> same encoding of the invariant set.
> 
> I might be wrong, but I believe the "invariant" requirement is there so that
> the kernel can safely process pathnames containing <dot> and <slash>
> and terminal characters <newline> and <carriage-return>
> while safely ignoring current locale setting of processes.


Well, the pathname processing should be a function of the filesystem. Eg if you 
have a windows
filesystem, or an apple filesystem mounted on a linux operating system, then 
the file names
of the foreign system should be interpreted as for the originating system in 
question.
I am not sure of the encoding of filenemes on windows and apple system, but 
their modern
default character encoding is utf-16.

> I do however question the nativity of EBCDIC on IBM systems.
> 
> Does z/OS support EBCDIC regular expressions, process pathname and
> kernel configuration strings in EBCDIC, and always have a compilation 
> enviremont
> where char is unsigned?
> (to hold positive character values as required in section 6.1 bullet point 5)
> 
> I've never had a licensed z/OS on hand, therefore I wouldn't know.
> 
> Then we could have conforming EBCDIC
> systems, and also implementations that can conform using UTF-16 and
> different 8-bit codesets. For instance 'A' is coded x0041 (two bytes) in 
> UTF-16
> and x41 (only one byte) in cp850, and UTF-8.
> 
> Majority of the programs specified in the standard use byte-oriented string 
> coding;
> the type of argv in main is char*[]; majority of ustar and pax files in the 
> wild
> store string in byte-oriented encoding.
> 
> UTF-16 as native coding on *nix systems I believe just causes problems.

I don't have windos nor apple systems, but they run utf-16 natively, and recent
Windows 10 system have a full linux (ubuntu) subsystem. I could also see 
problems
with utf-16 and posix, but at least apple should have solved that problem with 
OS X and IOS.

> ISO 30112 - which has enhanced POSIX locales and charmaps in a backward 
> compatible way -
> has lifted the constraint on the coding of the portable
> character set. This is also easy to code, it costs something like one extra
> indexing instruction. This is not rocket science.
> 
> ISO 30112 also has a elabirated locale covering all of UCS/Unicode,
> but in a coding independent way, so it would work both for utf-8 and utf-16.
> You know, tere are a few major systems out ther that runs utf-18, such as at
> least OS X and IOS.
> 
> The 30112 i18n locale is actually proven technology, and has been used in 
> Linux
> implementations for something like 20 years. Many hundreds of locales have 
> been built
> on it.
> 
> I get confused with this part.
> 1) I'm not entirely sure ISO TR 30112 would be within the scope of POSIX.

30112 should be within the scope of POSIX. The POSIX group and SC22 agreed
some many years ago that POSIX did not have the expertise to develop POSIX 
i18n, and let SC22
go on doing it. This eventually led to 30112.

> 2) What is with that UTF-18 thing? Byte is defined as octet in POSIX!

Sorry, it was a misspeling, I meant utf-16.

Keld

Re: UTF-8 locale & POSIX text model

Reply via email to