On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote:
> On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <st...@pearwood.info>wrote:

> > If you consider PDF as binary with occasional pieces of ASCII text, then
> > working with bytes makes sense. But I wonder whether it might be better
> > to consider PDF as mostly text with some binary bytes. Even though the
> > bulk of the PDF will be binary, the interesting bits are text. E.g. your
> > example:

    10 0 obj
      << /Type /XObject
         /Width 100
         /Height 100
         /Alternates 15 0 R
         /Length 2167
      >>
    stream
    ...binary image data...
    endstream
    endobj


> > Even though the binary image data is probably much, much larger in
> > length than the text shown above, it's (probably) trivial to deal with:
> > convert your image data into bytes, decode those bytes into Latin-1,
> > then concatenate the Latin-1 string into the text above.
> 
> This is similar to what Chris Barker suggested. I also don't try to be
> difficult here but please explain to me one thing. To treat bytes as if
> they were Latin-1 is bad idea, 

Correct. Bytes are not Latin-1. Here are some bytes which represent a 
word I extracted from a text file on my computer: 

    b'\x8a\x75\xa7\x65\x72\x73\x74'

If you imagine that they are Latin-1, you might think that the word 
is a C1 control character ("VTS", or Vertical Tabulation Set) followed 
by "u§erst", but it is not. It is actually the German word "äußerst" 
("extremely"), and the text file was generated on a 1990s vintage 
Macintosh using the MacRoman "extended ASCII" code page.


> that's why "%f" got dropped in the first
> place, right? How is it then alright to put an image inside an Unicode
> string?

The point that I am making is that many people want to add formatting 
operations to bytes so they can put ASCII strings inside bytes. But (as 
far as I can tell) they don't need to do this, because they can treat 
Unicode strings containing code points U+0000 through U+00FF (i.e. the 
same range as handled by Latin-1) as if they were bytes. This gives you:

- convenient syntax, no need to prefix strings with b;

- mostly avoid needing to decode and encode strings, except at a 
  few points in your code;

- the full set of string methods;

- can easily include arbitrary octal or hex byte values, using \o and
  \x escapes;

- error checking: when you finally encode the text to bytes before 
  writing to a file, or sending over a wire, any code-point greater 
  than U+00FF will give you an exception unless explicitly silenced.

No need to wait for Python 3.5 to come out, you can do this *right now*.

Of course, this is a little bit "unclean", it breaks the separation of 
text and bytes by treating bytes *as if* they were Unicode code points, 
which they are not, but I believe that this is a practical technique 
which is not too hard to deal with. For instance, suppose I have a 
mixed format which consists of an ASCII tag, a number written in ASCII, 
a NULL separator, and some binary data:

# Using bytes
values = [29460, 29145, 31098, 27123]
blob = b"".join(struct.pack(">h", n) for n in values)
data = b"Tag:" + str(len(values)).encode('ascii') + b"\0" + blob

=> gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3'


That's a bit ugly, but not too ugly. I could write code like that. But 
if bytes had % formatting, I might write this instead:

data = b"Tag:%d\0%s" % (len(values), blob)


This is a small improvement, but I can't use it until Python 3.5 comes 
out. Or I could do this right now:


# Using text
values = [29460, 29145, 31098, 27123]
blob = b"".join(struct.pack(">h", n) for n in values)
data = "Tag:%d\0%s" % (len(values), blob.decode('latin-1'))

=> gives data = 'Tag:4\x00s\x14qÙyzió'

When I'm ready to transmit this over the wire, or write to disk, then I 
encode, and get:

data.encode('latin-1')
=> b'Tag:4\x00s\x14q\xd9yzi\xf3'


which is exactly the same as I got in the first place. In this case, I'm 
not using Latin-1 for the semantics of bytes to characters (e.g. byte 
\xf3 = char ó), but for the useful property that all 256 distinct bytes 
are valid in Latin-1. Any other encoding with the same property will do.

It is a little unfortunate that struct gives bytes rather than a str, 
but you can hide that with a simple helper function:

def b2s(bytes):
    return bytes.decode('latin1')

data = "Tag:%d\0%s" % (len(values), b2s(blob))



> Also, apart from the in/out conversions, do any other difficulties come to
> your mind?

No. If you accidentally introduce a non-Latin1 code point, when you 
decode you'll get an exception. 


-- 
Steven
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to