Hi Mike I could be missing something though is there an opportunity to set the encoding when your reading the file. The default is utf-8 though you can set enociding to be the actual encoding of the file that you are reading. These file could be ISO-8859-1 or another variant.
Cheers Dave On Sun, 8 Mar 2020 at 17:45, Mike Dewhirst <mi...@dewhirst.com.au> wrote: > Oh well ... maybe it isn't Python's fault. I just looked at the data input > file and found the ³ character in all places had been turned into a box. > When I edited the boxes back into ³ it all went well. > > I used Filezilla to get the input files across so I'll focus on that next. > > Sorry to interrupt your long weekend. > > Cheers > > Mike > > > On 8/03/2020 5:30 pm, Mike Dewhirst wrote: > > I'm now exclusively Python 3.6+ thank heavens but ... > > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 6500: > invalid start byte > > It just so happens that is the superscript 3 character. It also happens > that superscript 3 displays correctly and works properly on Windows 10 but > causes the above error on Ubuntu 18.04. I'm not paid enough to understand > why - hence this email if anyone can help. > > My current pain is because I'm pumping data into a database (PostgreSQL) > which needs such measures as 5µg/m³ and Python hates me. > > I think there is a valid argument for the Python utf-8 codec to > "special-case" subscript and superscript numeral unicode collisions with > ASCII or whatever Windows 10 uses. That would cover maths and chemistry > both. And save me a lot of pain. > > Thanks for any sympathy and many, many thanks for help on getting past > this. > > Cheers > > Mike > > PS: I use superscript and subscript numbers all the time because I'm > involved with chemical data. Here is how I usually deal with it ... > > > > from django.utils.encoding import smart_text > from django.utils.safestring import mark_safe > > > def subscript_to_ascii(raw=None): > """Swap subscript unicode chars into ordinary numbers for > synonym searches. > """ > formula = "" > clear = True > if raw is not None: > # for char in str(raw): > for char in raw: > if char == "[": > clear = False # permits [1] footnote references > elif char == "]": > clear = True > if clear: > if char == "\u2082": > char = "2" > elif char == "\u2083": > char = "3" > elif char == "\u2084": > char = "4" > elif char == "\u2085": > char = "5" > elif char == "\u2086": > char = "6" > elif char == "\u2087": > char = "7" > elif char == "\u2088": > char = "8" > elif char == "\u2089": > char = "9" > elif char == "\u2081": > char = "1" > elif char == "\u2080": > char = "0" > formula += char > return smart_text(formula) > > > def subscript(raw=None): > """Swap ordinary numbers for subscript unicode chars.""" > formula = "" > clear = True > if raw is not None: > for char in raw: > if char == "[": > clear = False # permits [1] footnote references > elif char == "]": > clear = True > if clear: > if char == "2": > char = "\u2082" > elif char == "3": > char = "\u2083" > elif char == "4": > char = "\u2084" > elif char == "5": > char = "\u2085" > elif char == "6": > char = "\u2086" > elif char == "7": > char = "\u2087" > elif char == "8": > char = "\u2088" > elif char == "9": > char = "\u2089" > elif char == "1": > char = "\u2081" > elif char == "0": > char = "\u2080" > formula += char > return smart_text(formula.encode("utf8")) > > > lc50 = subscript(LC50) > ld50 = subscript(LD50) > > > def safesubscript(raw=None, ascii=False): > """Uses marksafe to subscript instead of unicode chars. This looks > better on screen but cannot be used in places. > """ > formula = "" > clear = True > if raw is not None: > for char in raw: > if char == "[": > # don"t process any more digits just add to formula > clear = False # permits [1] footnote references > elif char == "]": > # start processing again > clear = True > if clear: > if char == "2" or char == "\u2082": > char = "<sub>2</sub>" > elif char == "3" or char == "\u2083": > char = "<sub>3</sub>" > elif char == "4" or char == "\u2084": > char = "<sub>4</sub>" > elif char == "5" or char == "\u2085": > char = "<sub>5</sub>" > elif char == "6" or char == "\u2086": > char = "<sub>6</sub>" > elif char == "7" or char == "\u2087": > char = "<sub>7</sub>" > elif char == "8" or char == "\u2088": > char = "<sub>8</sub>" > elif char == "9" or char == "\u2089": > char = "<sub>9</sub>" > elif char == "1" or char == "\u2081": > char = "<sub>1</sub>" > elif char == "0" or char == "\u2080": > char = "<sub>0</sub>" > formula += char > if ascii: > formula = formula.replace("<sub>", "").replace("</sub", "") > return mark_safe(smart_text(formula)) > > > > > > > > > > > _______________________________________________ > melbourne-pug mailing > listmelbourne-pug@python.orghttps://mail.python.org/mailman/listinfo/melbourne-pug > > > _______________________________________________ > melbourne-pug mailing list > melbourne-pug@python.org > https://mail.python.org/mailman/listinfo/melbourne-pug >
_______________________________________________ melbourne-pug mailing list melbourne-pug@python.org https://mail.python.org/mailman/listinfo/melbourne-pug