On Fri, Jun 6, 2014 at 2:54 AM, Rustom Mody <rustompm...@gmail.com> wrote:
> On Thursday, June 5, 2014 9:42:28 PM UTC+5:30, Chris Angelico wrote:
>> On Fri, Jun 6, 2014 at 1:33 AM, Steven D'Aprano wrote:
>> > In the Unix world, text formats and text
>> > processing is much more common in user-space apps than binary processing.
>> > Perhaps the definitive explanation and celebration of the Unix way is
>> > Eric Raymond's "The Art Of Unix Programming":
>> > http://www.catb.org/esr/writings/taoup/html/ch05s01.html
>> Specifically, this from the opening paragraph:
>> Text streams are a valuable universal format because they're easy for
>> human beings to read, write, and edit without specialized tools. These
>> formats are (or can be designed to be) transparent.
> A fact that stops being true when you tie up text with encodings.
> For two reasons:
> 1. The function/pair encode/decode mapping between byte-string and text
> cannot be a bijection because the byte-string set is larger than the text
> set. This is the error that Armin was hit by
> 2. Since there is not one but a zillion encodings possible we are not
> talking of one (possibly universal) data structure but a zillion
> ones: "Text streams are a universal format" - which encoding-ed
> form of text??
As soon as you store or transmit ANY form of information, you need to
worry about encodings. Ever heard of this thing called "network byte
order"? It's part of taming the wilds of integer encodings. The theory
is that the LC environment variables will carry all that crucial
out-of-band information about encodings, and while the practice isn't
perfect, it does still mean that there is such a thing as a text