On Sun, 21 Feb 2021, Johnny Billquist wrote:
pdftotext is really ugly if it converts text and creates a stream of bytes, but if it's a Unicode character, it just creates all the bytes required to encode the character. How can you in that case even differentiate between U+6161 and "AA" for example?
I presume, in such a case, that pdftotext will choose the non-surprising behaviour of printing "AA" as "AA" rather than \x{61}\x{61} ;-)
''Converting Unicode to UTF-8 in a "lossless" manner'' makes no sense. UTF-8 already is Unicode characters.
Well, they're separate things, actually (code points vs. an encoding format)--better discussed in its own thread. You can use other encoding formats for the same Unicode code points: UCS-2, UTF-16, UTF-32, UTF-7, UCS-4, ...
It's always lossless. You can convert back and forth all day long.
Again. not quite, which is why I put quotes around my lossless. (Also, my peculiar sense of humour getting in the way of good explanations.) For example, here are 3 different UTF-8 encodings of the same Unicode code-point for the character ASCII 'A': A = 0x41 A = 0xC1 0x81 A = 0xE0 0x81 0x81 Proper implementations of UTF-8 are supposed to treat all 3 (or more!) as the same, but, roll-your-own implementations generally don't--which leads to black-hats cracking your website... (Also an interesting topic better discussed elsewhere.) -RVP