Is the binaryness/textness of a data format a property?

2020-03-20 Thread Costello, Roger L. via Unicode
Hello Data Format Experts! [Definition] Property: an attribute, quality, or characteristic of something. JPEG is a binary data format. CSV is a text data format. Question #1: Is the binaryness/textness of a data format a property? Question #2: If the answer to Question #1 is yes, then what is

RE: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
ble things. Question: "characters" are defined as only the visible things, right? I conclude: Binary files may contain arbitrary text. Text files may contain binary, but only a restricted set of binary. Do you agree? /Roger From: Costello, Roger L. Sent: Friday, February

Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Costello, Roger L. via Unicode
Hi Folks, There are binary files and there are text files. Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ. To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters.

A neat description of encoding characters

2019-12-02 Thread Costello, Roger L. via Unicode
>From the book titled "Computer Power and Human Reason" by Joseph Weizenbaum, >p.74-75 Suppose that the alphabet with which we wish to concern ourselves consists of 256 distinct symbols. Imagine that we have a deck of 256 cards, each of which has a distinct symbol of our alphabet printed on

Is the Unicode Standard "The foundation for all modern software and communications around the world"?

2019-11-19 Thread Costello, Roger L. via Unicode
Hi Folks, Today I received an email from the Unicode organization. The email said this: (italics and yellow highlighting are mine) The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and

Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8?

2019-05-15 Thread Costello, Roger L. via Unicode
Hello Unicode experts! Which is correct: (a) The input file contains a string. The string is encoded using UTF-8. (b) The input file contains a string. The string is encoded with UTF-8. (c) The input file contains a string. The string is encoded in UTF-8. (d) Something else (what?) /Roger

Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Costello, Roger L. via Unicode
Hello Unicode Experts! As I understand it, endian-ness applies to multi-byte words. Endian-ness does not apply to ASCII characters because each character is a single byte. Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character

RE: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Costello, Roger L. via Unicode
Hi Folks, Thank you for your outstanding responses! Below is a summary of what I learned. Are there any errors in the summary? Is there anything you would add? Please let me know of anything that is not clear. /Roger 1. While base64 encoding is usually applied to binary, it is also

Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-12 Thread Costello, Roger L. via Unicode
Hi Unicode Experts, Suppose base64 encoding is applied to m to yield base64 text t. Next, suppose base64 encoding is applied to m' to yield base64 text t'. If m is not equal to m', then t will not equal t'. In other words, given different inputs, base64 encoding always yields different

RE: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hi Folks, Thank you very much for your fantastic comments! Below I summarized the issue and your comments. At the bottom is a set of proposed requirements (for my clients) on applications that receive iCalendar files. Some questions: - Have I captured all your comments? Any more comments? -

Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Costello, Roger L. via Unicode
Hello Unicode Experts! Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence. Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot

Default character encoding for each operating system?

2016-09-15 Thread Costello, Roger L.
Hi Folks, In a book that I am reading [1] the author mentions "the default character encoding for the operating system." What is the default character encoding of: - Windows 10 - Mac OS - Linux /Roger [1] Practical Common Lisp by Peter Seibel, p. 165 (footnote 2).

RE: less-than or equal to with dot in the less-than part?

2016-08-10 Thread Costello, Roger L.
) /Roger -Original Message- From: Andrew West [mailto:andrewcw...@gmail.com] Sent: Wednesday, August 10, 2016 5:08 AM To: Costello, Roger L. <coste...@mitre.org> Cc: unicode@unicode.org Subject: Re: less-than or equal to with dot in the less-than part? On 10 August 2016 at 09:45, Co

less-than or equal to with dot in the less-than part?

2016-08-10 Thread Costello, Roger L.
Hi Folks, Here is the "less-than with dot" symbol: ⋖ Here is the "less-than or equal to" symbol: ≤ I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of

Are there Unicode symbols for parenthesis generator symbols?

2016-06-26 Thread Costello, Roger L.
Hi Folks, In the book Parsing Techniques the authors use a less than symbol with a dot tucked inside for the open parenthesis and a greater than symbol with a dot tucked insider for the close parenthesis. Also, they use an equal sign with a dot over it. You can see the 3 symbols here:

Symbol for an upside down capital L, pointing to the right?

2015-12-25 Thread Costello, Roger L.
Hi Folks, Here is the upside down capital L, pointing to the left: ⅂ - TURNED SANS-SERIF CAPITAL L (U+2142) Is there a symbol for an upside down capital L, pointing to the right? /Roger

Applying Postel's Law to XML, from a Unicode perspective?

2015-06-28 Thread Costello, Roger L.
Hi Folks, Postel's Law says: Be liberal in what you accept, and conservative in what you send. How might Postel's Law be applied to web services that receive XML and sends out XML? Here's one idea: a web service is willing to receive UTF-8 XML documents containing a

Unicode Expert's way of Writing Data Specifications?

2015-06-10 Thread Costello, Roger L.
Hi Folks, I seek recommendations from the Unicode experts on how to write data specifications that are precise, from a Unicode perspective. Let's take an example. A (fictitious) data specification says this: The name of the airplane's flight path must take this form: FLTPATH

RE: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)

2015-05-09 Thread Costello, Roger L.
Hi Folks, Just want you to know, this discussion is EXCELLENT. I am learning a lot. Thank you! /Roger

RE: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-08 Thread Costello, Roger L.
? /Roger From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy Sent: Thursday, May 07, 2015 11:08 PM To: Daniel Bünzli Cc: Unicode@unicode.org; Costello, Roger L.; Markus Scherer Subject: Re: Ways to detect that in JSON \u does not correspond to a Unicode character

Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-07 Thread Costello, Roger L.
Hi Folks, The JSON specification says that a character may be escaped using this notation: \u( are four hex digits) However, not every four hex digits corresponds to a Unicode character. Are there tools to scan a JSON document to detect the presence of \u, where does not

Can a single text document use multiple character encodings?

2013-08-28 Thread Costello, Roger L.
Hi Folks, Can a single text document use multiple character encodings? For example, can some text be encoded as UTF-8 while other text is encoded as UTF-16 - within the same document? /Roger

What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Costello, Roger L.
Hi Folks, Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it

Are there any pre-Unicode 5.2 applications still in existence?

2013-03-08 Thread Costello, Roger L.
Hi Folks, I have learned that: In some versions prior to Unicode 5.2, conformance clause C7 allowed the deletion of noncharacter code points [1] Are there still in existence applications which delete noncharacter code points from strings? Are there any pre-Unicode 5.2 applications

Can the combining diacritical marks combine with any base character?

2013-02-10 Thread Costello, Roger L.
Hi Folks, Can the combining diacritical marks combine with any base character? For example, consider this character sequence: '' followed immediately by the combining tilde character (U+0303) Is that legal? If yes, wouldn't normalizing this: comment(U+0303) to NFC

RE: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-03 Thread Costello, Roger L.
Hi Folks, Thank you for your excellent responses. Based on your responses, I now wonder why the W3C recommends NFC be used for text exchanges over the Internet. Aside from the size advantage of NFC, there seems to be tremendous advantages to using NFD: - It’s easier to do searches and other

Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-01 Thread Costello, Roger L.
Hi Folks, The W3C recommends [1] text sent out over the Internet be in Normalized Form C (NFC): This document therefore chooses NFC as the base for Web-related early normalization. So why would one ever generate text in decomposed form (NFD)? Do any programming languages output text

In UTF-16 no codepoints are assigned to D800 - DFFF ... is that range also reserved in UTF-8 and UTF-32?

2013-01-25 Thread Costello, Roger L.
Hi Folks, I am learning how to create variable-length UTF-16 strings using surrogate pairs. Neat stuff. I learned that the range from D800 to DFFF is reserved because it is used to create variable-length UTF-16 strings. Thus, there are no codepoints assigned to the range D800 to DFFF in

Why are the low surrogates numerically larger than the high surrogates?

2013-01-23 Thread Costello, Roger L.
Hi Folks, The book Unicode Demystified says this (page 190, first paragraph): The surrogate range is divided in half. The range from U+D800 to U+DBFF contains the high surrogates, and the range from U+DC00 to U+DFF contains the low surrogates. Why are the low surrogates

Are there Unicode processors?

2013-01-07 Thread Costello, Roger L.
Hi Folks, An XML processor breaks up an XML document into its parts -- here's a start tag, here's element content, here's an end tag, etc. -- and then makes those parts (along with information about each part such as this part is a start tag and this part is element content) available to XML

If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?

2013-01-06 Thread Costello, Roger L.
Hi Folks, In the book, Unicode Demystified (p. xxii) it says: An English-speaking programmer might assume, for example, that given the three characters X, Y, and Z, that if X sorts before Y, then XZ sorts before YZ. This works for English, but fails for many languages.

Why is endianness relevant when storing data on disks but not when in memory?

2013-01-05 Thread Costello, Roger L.
Hi Folks, In the book Fonts Encodings it says (I think) that endianness is relevant only when storing data on disks. Why is endianness is not relevant when data is in memory? On page 62 it says: ... when we store ... data on disk, we write not 32-bit (or 16-bit) numbers but series

What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Costello, Roger L.
Hi Folks, In the book, Fonts Encodings (p. 61, first paragraph) it says: ... we select a substring that begins with a combining character, this new string will not be a valid string in Unicode. What does it mean to not be a valid string in Unicode? /Roger

Terminology: does the term codepoint apply to non-Unicode character sets?

2013-01-01 Thread Costello, Roger L.
Hi Folks, Does the term codepoint apply to non-Unicode character sets? For example, are there codepoints in iso-8859-1? In Windows-1252? /Roger

Interoperability is getting better ... What does that mean?

2012-12-30 Thread Costello, Roger L.
Hi Folks, I have heard it stated that, in the context of character encoding and decoding: Interoperability is getting better. Do you have data to back up the assertion that interoperability is getting better? Below is a summary of my understanding of interoperability. Would you inform me

When the reader enters the digital space for writing, he participates in the unending ballet between characters and glyphs

2012-12-22 Thread Costello, Roger L.
Hi Folks, I figure the people on this list can truly appreciate this: Homo Sapiens is a species that writes. And among the large number of tools used for writing, the most recent and the most complex is the computer -- a tool for reading and writing, a medium for storage, and a means of

A few questions about encoding discovery, copying text, and pasting text in one encoding into text in another encoding

2012-12-19 Thread Costello, Roger L.
Hi Folks, Newbie here. I have a few questions about encoding discovery, copying text, and pasting text in one encoding into text in another encoding. 1. I open a text editor and then input a text document. How does the text editor discover what the document's encoding is? Is its encoding