On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano <st...@pearwood.info>wrote:
> > I'm sorry, I don't understand what you mean here. I'm honestly not > trying to be difficult, but you sound confident that you understand what > you are doing, but your description doesn't make sense to me. To me, it > looks like you are conflating bytes and ASCII characters, that is, > assuming that characters "are" in some sense identical to their ASCII > representation. Let me explain: > > The integer that in English is written as 100 is represented in memory > as bytes 0x0064 (assuming a big-endian C short), so when you say "an > integer is written down AS-IS" (emphasis added), to me that says that > the PDF file includes the bytes 0x0064. But then you go on to write the > three character string "100", which (assuming ASCII) is the bytes > 0x313030. Going from the C short to the ASCII representation 0x313030 is > nothing like inserting the int "as-is". To put it another way, the > Python 2 '%d' format code does not just copy bytes. > Sorry, I should've included an example: when I said "as-is" I meant "1", "0", "0" so that would be yours "0x313030." > If you consider PDF as binary with occasional pieces of ASCII text, then > working with bytes makes sense. But I wonder whether it might be better > to consider PDF as mostly text with some binary bytes. Even though the > bulk of the PDF will be binary, the interesting bits are text. E.g. your > example: > > Even though the binary image data is probably much, much larger in > length than the text shown above, it's (probably) trivial to deal with: > convert your image data into bytes, decode those bytes into Latin-1, > then concatenate the Latin-1 string into the text above. > This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea, that's why "%f" got dropped in the first place, right? How is it then alright to put an image inside an Unicode string? Also, apart from the in/out conversions, do any other difficulties come to your mind? Please also take note that in Python 3.3 and better, the internal > representation of Unicode strings containing only code points up to 255 > (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte > per character. > I guess you meant [C]Python... In any case, thanks for the detailed reply.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com