On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote: > On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <st...@pearwood.info>wrote:
> > If you consider PDF as binary with occasional pieces of ASCII text, then > > working with bytes makes sense. But I wonder whether it might be better > > to consider PDF as mostly text with some binary bytes. Even though the > > bulk of the PDF will be binary, the interesting bits are text. E.g. your > > example: 10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj > > Even though the binary image data is probably much, much larger in > > length than the text shown above, it's (probably) trivial to deal with: > > convert your image data into bytes, decode those bytes into Latin-1, > > then concatenate the Latin-1 string into the text above. > > This is similar to what Chris Barker suggested. I also don't try to be > difficult here but please explain to me one thing. To treat bytes as if > they were Latin-1 is bad idea, Correct. Bytes are not Latin-1. Here are some bytes which represent a word I extracted from a text file on my computer: b'\x8a\x75\xa7\x65\x72\x73\x74' If you imagine that they are Latin-1, you might think that the word is a C1 control character ("VTS", or Vertical Tabulation Set) followed by "u§erst", but it is not. It is actually the German word "äußerst" ("extremely"), and the text file was generated on a 1990s vintage Macintosh using the MacRoman "extended ASCII" code page. > that's why "%f" got dropped in the first > place, right? How is it then alright to put an image inside an Unicode > string? The point that I am making is that many people want to add formatting operations to bytes so they can put ASCII strings inside bytes. But (as far as I can tell) they don't need to do this, because they can treat Unicode strings containing code points U+0000 through U+00FF (i.e. the same range as handled by Latin-1) as if they were bytes. This gives you: - convenient syntax, no need to prefix strings with b; - mostly avoid needing to decode and encode strings, except at a few points in your code; - the full set of string methods; - can easily include arbitrary octal or hex byte values, using \o and \x escapes; - error checking: when you finally encode the text to bytes before writing to a file, or sending over a wire, any code-point greater than U+00FF will give you an exception unless explicitly silenced. No need to wait for Python 3.5 to come out, you can do this *right now*. Of course, this is a little bit "unclean", it breaks the separation of text and bytes by treating bytes *as if* they were Unicode code points, which they are not, but I believe that this is a practical technique which is not too hard to deal with. For instance, suppose I have a mixed format which consists of an ASCII tag, a number written in ASCII, a NULL separator, and some binary data: # Using bytes values = [29460, 29145, 31098, 27123] blob = b"".join(struct.pack(">h", n) for n in values) data = b"Tag:" + str(len(values)).encode('ascii') + b"\0" + blob => gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3' That's a bit ugly, but not too ugly. I could write code like that. But if bytes had % formatting, I might write this instead: data = b"Tag:%d\0%s" % (len(values), blob) This is a small improvement, but I can't use it until Python 3.5 comes out. Or I could do this right now: # Using text values = [29460, 29145, 31098, 27123] blob = b"".join(struct.pack(">h", n) for n in values) data = "Tag:%d\0%s" % (len(values), blob.decode('latin-1')) => gives data = 'Tag:4\x00s\x14qÙyzió' When I'm ready to transmit this over the wire, or write to disk, then I encode, and get: data.encode('latin-1') => b'Tag:4\x00s\x14q\xd9yzi\xf3' which is exactly the same as I got in the first place. In this case, I'm not using Latin-1 for the semantics of bytes to characters (e.g. byte \xf3 = char ó), but for the useful property that all 256 distinct bytes are valid in Latin-1. Any other encoding with the same property will do. It is a little unfortunate that struct gives bytes rather than a str, but you can hide that with a simple helper function: def b2s(bytes): return bytes.decode('latin1') data = "Tag:%d\0%s" % (len(values), b2s(blob)) > Also, apart from the in/out conversions, do any other difficulties come to > your mind? No. If you accidentally introduce a non-Latin1 code point, when you decode you'll get an exception. -- Steven _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com