It is possible to ensure an octet represents a graphic of some sort, using 
uselocale(), isprint(), but this does not do any iconv() type conversion that 
matches up code point names. So, what's on the screen will still be garbage. It 
will simply be garbage with glyphs you're used to seeing, not some other 
language's or the box drawing glyphs of CP437. For non-garbage the application 
still needs to set the locale to use the code page of the data, it cannot be 
assumed.

Also, on a byte-by-byte basis for a byte oriented read of a text file, UTF-8 
and ISO-8859-1 are NOT equivalent; UTF-8 has only 95 graphics, with the rest 
being control codes or unassigned/illegal versus 191 graphics and 65 control 
codes. What can be considered equivalent is processing a UCS-2 encoded file 
with a custom getc() that ignores all nulls. Then if the data is all UCS-2 code 
points below 256 the string read will look like it was 8859 encoded. However, 
because UCS-2 permits arbitrary designation of C0 and C1 sets, some control 
codes may not be the same and you therefore can still have garbage.

On Tuesday, January 8, 2019 Joerg Schilling 
<joerg.schill...@fokus.fraunhofer.de> wrote:

Robert Elz <k...@munnari.oz.au> wrote:


>    Date:        Tue, 8 Jan 2019 12:51:16 +0100
>    From:        Joerg Schilling <joerg.schill...@fokus.fraunhofer.de>
>    Message-ID:  
><5c348eb4.tc7thjo20z6olugw%joerg.schill...@fokus.fraunhofer.de>
>
>  | e.g. because Unicode is "based" on ISO-8859-1 in that the low 256 values 
>in the 
>  | UNICODE encoding is identical to the encoding used by ISO-8859-1.
>
> That's not a rational reason for assuming that any data which is not UTF-8
> is 8859-1 (or 10646-1).    If a utf-8 decode fails, the only solution that 
> works reliably is for someone to tell the software which encoding it is
> (well, that's true, even if it is utf-8).


I am talking about text files and since there is the way to verify whether the 
value is a printable ISO-8859-1 character. It is simple to write a program that 
correctly outputs text in either UTF-8 or ISO-8859-1 without being given the 
actual encoding if you just expect one of these two encodings.

Jörg

-- 
EMail:jo...@schily.net                    (home) Jörg Schilling D-13353 Berlin
    joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.org/private/ http://sf.net/projects/schilytools/files/'



Reply via email to