Aw: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Jörg Knappen
 


From a practical point of view, text files contain text that is broken into lines. And by a long-standing tradition,

line breaks are treated differently among different operating systems. Whenever one transfers a text file between

operating systems, the process behing that transfer cares to convert the line breaks according to the target OS's conventions.

 

Binary files are much simpler: They can be just transfered without converting anything, even between different operating systems.

 

Of course, this does not mean that an executable under one OS remains being a valid exe under another OS, but there lots of non-executable

binaries that are useful independent of the OS (e.g. images, sound files, video files, lots of other application files).

 

So, for a successful file transfer one needs to know whether it is text or binary, and handle it accordingly.

 

--Jörg Knappen

 

Gesendet: Freitag, 21. Februar 2020 um 13:21 Uhr
Von: "Costello, Roger L. via Unicode" 
An: "unicode@unicode.org" 
Betreff: Why do binary files contain text but text files don't contain binary?




Hi Folks,

 

There are binary files and there are text files.

 

Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ.

 

To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters. (Of course, text files may contain a text-encoding of binary, such as base64-encoded text.)

 

Why the asymmetry?

 

/Roger








Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Hans Åberg via Unicode


> On 21 Feb 2020, at 13:21, Costello, Roger L. via Unicode 
>  wrote:
> 
> There are binary files and there are text files.

In C, when opening a file as binary with the function fopen, the newlines are 
untranslated [1]. If not using this option, the file is informally text, which 
means that internally in the program, one can assume that the newline [2] is 
the character U+000A LINE FEED (LF).

1. https://en.cppreference.com/w/cpp/io/c/fopen
2. https://en.wikipedia.org/wiki/Newline





RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Doug Ewell via Unicode
Costello, Roger L. wrote: > Text files may indeed contain binary (i.e., bytes that are not> interpretable as characters). Namely, text files may contain newlines,> tabs, and some other invisible things.>> Question: "characters" are defined as only the visible things, right? In addition to this being incorrect, as Ken and Richard (so far) have pointed out, this isn't the distinction you are looking for. All file formats contain data which is relevant to that file format. Zip files, executables, JPEGs, MP4s, all contain specific data structured in a specific way. If any of them has that structure interrupted by random bytes, the format has been broken and the file is corrupt. It is no different for text data, which is expected to contain certain bytes and is normally not expected to be interrupted by a series of ranëH‰UÀHƒÈÿH Does that help? --Doug Ewell | Thornton, CO, US | ewellic.org 


Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Ken Whistler via Unicode


On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote:


Text files may indeed contain binary (i.e., bytes that are not 
interpretable as characters). Namely, text files may contain newlines, 
tabs, and some other invisible things.


Question: "characters" are defined as only the visible things, right?

No. You've gone astray right there. Please read Chapter 2 of the Unicode 
Standard, and in particular, Section 2.4, Code Points and Characters:


https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564

All of those types of characters can occur in Unicode plain text. (With 
the exception of surrogate code points.)



I conclude:

Binary files may contain arbitrary text.


Binary files can contain *whatever*, including text.


Text files may contain binary, but only a restricted set of binary.

The distinction is definitional. A text file contains *only* characters, 
interpretable by a specific character encoding (usually Unicode, these 
days).


But a text file need not be "plain text". An HTML file is an example of 
a text file (it contains only a sequence of characters, whose identity 
and interpretation is all clearly specified by looking them up in the 
Unicode Standard), but it is not *plain* text. It is *rich* text, 
consisting of markup tags interspersed with runs of plain text.


Another distinction that may be leading you astray is the distinction 
between binary file transfer and text file transfer. If you are using 
ftp, for example, you can specify use of binary file transfer, *even if* 
the file you are transferring is actually a text file. That simply means 
that the file transfer will agree to treat the entire file as a binary 
blob and transfer it byte-for-byte intact. A text file transfer, on the 
other hand, may look for "lines" in a text file and may adjust line 
endings to suit the receiving platform conventions.



Do you agree?


No.

--Ken



Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Richard Wordingham via Unicode
On Fri, 21 Feb 2020 15:53:52 +
"Costello, Roger L. via Unicode"  wrote:

> Based on a private correspondence, I now realize that this statement:
> 
> 
> 
> > Text files do not contain binary  
> 
> 
> 
> is  not correct.
> 
> 
> 
> Text files may indeed contain binary (i.e., bytes that are not
> interpretable as characters). Namely, text files may contain
> newlines, tabs, and some other invisible things.
> 
> 
> 
> Question: "characters" are defined as only the visible things, right?

No, white space (e.g. spaces, tabs and newlines) is normally considered
to be composed of characters.  And then there are much harder to discern
things, such as zero-width spaces, line-break suppressors such as
U+2060 WORD JOINER, and soft hyphens (interpreted as line-break
opportunities).

Richard.


RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Based on a private correspondence, I now realize that this statement:



> Text files do not contain binary



is  not correct.



Text files may indeed contain binary (i.e., bytes that are not interpretable as 
characters). Namely, text files may contain newlines, tabs, and some other 
invisible things.



Question: "characters" are defined as only the visible things, right?



I conclude:



Binary files may contain arbitrary text.

Text files may contain binary, but only a restricted set of binary.



Do you agree?



/Roger


From: Costello, Roger L. 
Sent: Friday, February 21, 2020 7:22 AM
To: unicode@unicode.org
Subject: Why do binary files contain text but text files don't contain binary?

Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the start of 
Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e., bytes that 
cannot be interpreted as characters. (Of course, text files may contain a 
text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger


Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread via Unicode

Dear Roger,

because in when unicode is used in real life, utf8 etc then

  text ⊂ binary

John Knightley

On 2020-02-21 20:21, Costello, Roger L. via Unicode wrote:

Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the
start of Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e.,
bytes that cannot be interpreted as characters. (Of course, text files
may contain a text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger




Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Hi Folks,

There are binary files and there are text files.

Binary files often contain portions that are text. For example, the start of 
Windows executable files is the text MZ.

To the best of my knowledge, text files never contain binary, i.e., bytes that 
cannot be interpreted as characters. (Of course, text files may contain a 
text-encoding of binary, such as base64-encoded text.)

Why the asymmetry?

/Roger