DavidThanks for responding. And sorry to all - I should have reported that Javier solved my problem.Windows 10 has some sort of gotcha with utf16 which it automatically imposes if you don't watch closely. Just transferring a utf8 file using ftp does it. The solution was to edit the file after transfer to Linux and save again as utf8.Microsoft is the pain, not superscripts.CheersMikePs. This android phone top-posts automatically and I can't be bothered figuring out how to defeat that. -------- Original message --------From: David Micallef <d...@montagesoftware.com.au> Date: 8/3/20 19:31 (GMT+10:00) To: Melbourne Python Users Group <melbourne-pug@python.org> Subject: Re: [melbourne-pug] Superscript chars are a pain Hi MikeI could be missing something though is there an opportunity to set the encoding when your reading the file. The default is utf-8 though you can set enociding to be the actual encoding of the file that you are reading. These file could be ISO-8859-1 or another variant.CheersDaveOn Sun, 8 Mar 2020 at 17:45, Mike Dewhirst <mi...@dewhirst.com.au> wrote: Oh well ... maybe it isn't Python's fault. I just looked at the data input file and found the ³ character in all places had been turned into a box. When I edited the boxes back into ³ it all went well. I used Filezilla to get the input files across so I'll focus on that next. Sorry to interrupt your long weekend. Cheers Mike On 8/03/2020 5:30 pm, Mike Dewhirst wrote: I'm now exclusively Python 3.6+ thank heavens but ... UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 6500: invalid start byte It just so happens that is the superscript 3 character. It also happens that superscript 3 displays correctly and works properly on Windows 10 but causes the above error on Ubuntu 18.04. I'm not paid enough to understand why - hence this email if anyone can help. My current pain is because I'm pumping data into a database (PostgreSQL) which needs such measures as 5µg/m³ and Python hates me. I think there is a valid argument for the Python utf-8 codec to "special-case" subscript and superscript numeral unicode collisions with ASCII or whatever Windows 10 uses. That would cover maths and chemistry both. And save me a lot of pain. Thanks for any sympathy and many, many thanks for help on getting past this. Cheers Mike PS: I use superscript and subscript numbers all the time because I'm involved with chemical data. Here is how I usually deal with it ... from django.utils.encoding import smart_text from django.utils.safestring import mark_safe def subscript_to_ascii(raw=None): """Swap subscript unicode chars into ordinary numbers for synonym searches. """ formula = "" clear = True if raw is not None: # for char in str(raw): for char in raw: if char == "[": clear = False # permits [1] footnote references elif char == "]": clear = True if clear: if char == "\u2082": char = "2" elif char == "\u2083": char = "3" elif char == "\u2084": char = "4" elif char == "\u2085": char = "5" elif char == "\u2086": char = "6" elif char == "\u2087": char = "7" elif char == "\u2088": char = "8" elif char == "\u2089": char = "9" elif char == "\u2081": char = "1" elif char == "\u2080": char = "0" formula += char return smart_text(formula) def subscript(raw=None): """Swap ordinary numbers for subscript unicode chars.""" formula = "" clear = True if raw is not None: for char in raw: if char == "[": clear = False # permits [1] footnote references elif char == "]": clear = True if clear: if char == "2": char = "\u2082" elif char == "3": char = "\u2083" elif char == "4": char = "\u2084" elif char == "5": char = "\u2085" elif char == "6": char = "\u2086" elif char == "7": char = "\u2087" elif char == "8": char = "\u2088" elif char == "9": char = "\u2089" elif char == "1": char = "\u2081" elif char == "0": char = "\u2080" formula += char return smart_text(formula.encode("utf8")) lc50 = subscript(LC50) ld50 = subscript(LD50) def safesubscript(raw=None, ascii=False): """Uses marksafe to subscript instead of unicode chars. This looks better on screen but cannot be used in places. """ formula = "" clear = True if raw is not None: for char in raw: if char == "[": # don"t process any more digits just add to formula clear = False # permits [1] footnote references elif char == "]": # start processing again clear = True if clear: if char == "2" or char == "\u2082": char = "<sub>2</sub>" elif char == "3" or char == "\u2083": char = "<sub>3</sub>" elif char == "4" or char == "\u2084": char = "<sub>4</sub>" elif char == "5" or char == "\u2085": char = "<sub>5</sub>" elif char == "6" or char == "\u2086": char = "<sub>6</sub>" elif char == "7" or char == "\u2087": char = "<sub>7</sub>" elif char == "8" or char == "\u2088": char = "<sub>8</sub>" elif char == "9" or char == "\u2089": char = "<sub>9</sub>" elif char == "1" or char == "\u2081": char = "<sub>1</sub>" elif char == "0" or char == "\u2080": char = "<sub>0</sub>" formula += char if ascii: formula = formula.replace("<sub>", "").replace("</sub", "") return mark_safe(smart_text(formula)) _______________________________________________ melbourne-pug mailing list melbourne-pug@python.org https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________ melbourne-pug mailing list melbourne-pug@python.org https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________ melbourne-pug mailing list melbourne-pug@python.org https://mail.python.org/mailman/listinfo/melbourne-pug