Re: Unicode to ASCII

RVP Sun, 21 Feb 2021 19:47:43 -0800

On Sun, 21 Feb 2021, Johnny Billquist wrote:

pdftotext is really ugly if it converts text and creates a stream of bytes,but if it's a Unicode character, it just creates all the bytesrequired to encode the character. How can you in that case even differentiatebetween U+6161 and "AA" for example?


I presume, in such a case, that pdftotext will choose the non-surprising
behaviour of printing "AA" as "AA" rather than \x{61}\x{61} ;-)

''Converting Unicode to UTF-8 in a "lossless" manner'' makes no sense.

UTF-8 already is Unicode characters.


Well, they're separate things, actually (code points vs. an encoding
format)--better discussed in its own thread. You can use other
encoding formats for the same Unicode code points: UCS-2, UTF-16, UTF-32,
UTF-7, UCS-4, ...

It's always lossless. You can convertback and forth all day long.


Again. not quite, which is why I put quotes around my lossless. (Also,
my peculiar sense of humour getting in the way of good explanations.)

For example, here are 3 different UTF-8 encodings of the same Unicode
code-point for the character ASCII 'A':

A = 0x41
A = 0xC1 0x81
A = 0xE0 0x81 0x81

Proper implementations of UTF-8 are supposed to treat all 3 (or more!)
as the same, but, roll-your-own implementations generally don't--which
leads to black-hats cracking your website... (Also an interesting
topic better discussed elsewhere.)

-RVP

Re: Unicode to ASCII

Reply via email to