Is the binaryness/textness of a data format a property?

2020-03-20 Thread Costello, Roger L. via Unicode
Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is

RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Based on a private correspondence, I now realize that this statement: > Text files do not contain binary is not correct. Text files may indeed contain binary (i.e., bytes that are not interpretable as characters). Namely, text files may contain newlines, tabs, and some other invisible

Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Hi Folks, There are binary files and there are text files. Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ. To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters.

A neat description of encoding characters

2019-12-02 Thread Costello, Roger L. via Unicode
>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, >p.74-75 Suppose that the alphabet with which we wish to concern ourselves consists of 256 distinct symbols. Imagine that we have a deck of 256 cards, each of which has a distinct symbol of our alphabet printed on

Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread Costello, Roger L. via Unicode
Hi Folks, Today I received an email from the Unicode organization. The email said this: (italics and yellow highlighting are mine) The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and

Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Costello, Roger L. via Unicode
Hello Unicode experts! Which is correct: (a) The input file contains a string. The string is encoded using UTF-8. (b) The input file contains a string. The string is encoded with UTF-8. (c) The input file contains a string. The string is encoded in UTF-8. (d) Something else (what?) /Roger

Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Costello, Roger L. via Unicode
Hello Unicode Experts! As I understand it, endian-ness applies to multi-byte words. Endian-ness does not apply to ASCII characters because each character is a single byte. Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character

RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Costello, Roger L. via Unicode
Hi Folks, Thank you for your outstanding responses! Below is a summary of what I learned. Are there any errors in the summary? Is there anything you would add? Please let me know of anything that is not clear. /Roger 1. While base64 encoding is usually applied to binary, it is also

Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Costello, Roger L. via Unicode
Hi Unicode Experts, Suppose base64 encoding is applied to m to yield base64 text t. Next, suppose base64 encoding is applied to m' to yield base64 text t'. If m is not equal to m', then t will not equal t'. In other words, given different inputs, base64 encoding always yields different

RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hi Folks, Thank you very much for your fantastic comments! Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files. Some questions: - Have I captured all your comments? Any more comments? -

Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hello Unicode Experts! Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence. Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot