DavidThanks for responding. And sorry to all - I should have reported that
Javier solved my problem.Windows 10 has some sort of gotcha with utf16 which it
automatically imposes if you don't watch closely. Just transferring a utf8 file
using ftp does it. The solution was to edit the file after transfer to Linux
and save again as utf8.Microsoft is the pain, not superscripts.CheersMikePs.
This android phone top-posts automatically and I can't be bothered figuring out
how to defeat that.
-------- Original message --------From: David Micallef
<[email protected]> Date: 8/3/20 19:31 (GMT+10:00) To: Melbourne
Python Users Group <[email protected]> Subject: Re: [melbourne-pug]
Superscript chars are a pain Hi MikeI could be missing something though is
there an opportunity to set the encoding when your reading the file. The
default is utf-8 though you can set enociding to be the actual encoding of the
file that you are reading. These file could be ISO-8859-1 or another
variant.CheersDaveOn Sun, 8 Mar 2020 at 17:45, Mike Dewhirst
<[email protected]> wrote:
Oh well ... maybe it isn't Python's
fault. I just looked at the data input file and found the ³
character in all places had been turned into a box. When I edited
the boxes back into ³ it all went well.
I used Filezilla to get the input files across so I'll focus on
that next.
Sorry to interrupt your long weekend.
Cheers
Mike
On 8/03/2020 5:30 pm, Mike Dewhirst wrote:
I'm now exclusively Python 3.6+ thank heavens but
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in
position 6500: invalid start byte
It just so happens that is the superscript 3 character. It also
happens that superscript 3 displays correctly and works properly
on Windows 10 but causes the above error on Ubuntu 18.04. I'm
not paid enough to understand why - hence this email if anyone
can help.
My current pain is because I'm pumping data into a database
(PostgreSQL) which needs such measures as 5µg/m³ and Python
hates me.
I think there is a valid argument for the Python utf-8 codec to
"special-case" subscript and superscript numeral unicode
collisions with ASCII or whatever Windows 10 uses. That would
cover maths and chemistry both. And save me a lot of pain.
Thanks for any sympathy and many, many thanks for help on
getting past this.
Cheers
Mike
PS: I use superscript and subscript numbers all the time because
I'm involved with chemical data. Here is how I usually deal with
it ...
from django.utils.encoding import smart_text
from django.utils.safestring import mark_safe
def subscript_to_ascii(raw=None):
"""Swap subscript unicode chars into ordinary numbers for
synonym searches.
"""
formula = ""
clear = True
if raw is not None:
# for char in str(raw):
for char in raw:
if char == "[":
clear = False # permits [1] footnote
references
elif char == "]":
clear = True
if clear:
if char == "\u2082":
char = "2"
elif char == "\u2083":
char = "3"
elif char == "\u2084":
char = "4"
elif char == "\u2085":
char = "5"
elif char == "\u2086":
char = "6"
elif char == "\u2087":
char = "7"
elif char == "\u2088":
char = "8"
elif char == "\u2089":
char = "9"
elif char == "\u2081":
char = "1"
elif char == "\u2080":
char = "0"
formula += char
return smart_text(formula)
def subscript(raw=None):
"""Swap ordinary numbers for subscript unicode chars."""
formula = ""
clear = True
if raw is not None:
for char in raw:
if char == "[":
clear = False # permits [1] footnote
references
elif char == "]":
clear = True
if clear:
if char == "2":
char = "\u2082"
elif char == "3":
char = "\u2083"
elif char == "4":
char = "\u2084"
elif char == "5":
char = "\u2085"
elif char == "6":
char = "\u2086"
elif char == "7":
char = "\u2087"
elif char == "8":
char = "\u2088"
elif char == "9":
char = "\u2089"
elif char == "1":
char = "\u2081"
elif char == "0":
char = "\u2080"
formula += char
return smart_text(formula.encode("utf8"))
lc50 = subscript(LC50)
ld50 = subscript(LD50)
def safesubscript(raw=None, ascii=False):
"""Uses marksafe to subscript instead of unicode chars.
This looks
better on screen but cannot be used in places.
"""
formula = ""
clear = True
if raw is not None:
for char in raw:
if char == "[":
# don"t process any more digits just add to
formula
clear = False # permits [1] footnote
references
elif char == "]":
# start processing again
clear = True
if clear:
if char == "2" or char == "\u2082":
char = "<sub>2</sub>"
elif char == "3" or char == "\u2083":
char = "<sub>3</sub>"
elif char == "4" or char == "\u2084":
char = "<sub>4</sub>"
elif char == "5" or char == "\u2085":
char = "<sub>5</sub>"
elif char == "6" or char == "\u2086":
char = "<sub>6</sub>"
elif char == "7" or char == "\u2087":
char = "<sub>7</sub>"
elif char == "8" or char == "\u2088":
char = "<sub>8</sub>"
elif char == "9" or char == "\u2089":
char = "<sub>9</sub>"
elif char == "1" or char == "\u2081":
char = "<sub>1</sub>"
elif char == "0" or char == "\u2080":
char = "<sub>0</sub>"
formula += char
if ascii:
formula = formula.replace("<sub>",
"").replace("</sub", "")
return mark_safe(smart_text(formula))
_______________________________________________
melbourne-pug mailing list
[email protected]
https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________
melbourne-pug mailing list
[email protected]
https://mail.python.org/mailman/listinfo/melbourne-pug
_______________________________________________
melbourne-pug mailing list
[email protected]
https://mail.python.org/mailman/listinfo/melbourne-pug