Re: [melbourne-pug] Superscript chars are a pain

David Micallef Fri, 13 Mar 2020 05:17:51 -0700

Hi Mike

I could be missing something though is there an opportunity to set the
encoding when your reading the file. The default is utf-8 though you can
set enociding to be the actual encoding of the file that you are reading.
These file could be ISO-8859-1 or another variant.


Cheers

Dave

On Sun, 8 Mar 2020 at 17:45, Mike Dewhirst <mi...@dewhirst.com.au> wrote:

> Oh well ... maybe it isn't Python's fault. I just looked at the data input
> file and found the ³ character in all places had been turned into a box.
> When I edited the boxes back into ³ it all went well.
>
> I used Filezilla to get the input files across so I'll focus on that next.
>
> Sorry to interrupt your long weekend.
>
> Cheers
>
> Mike
>
>
> On 8/03/2020 5:30 pm, Mike Dewhirst wrote:
>
> I'm now exclusively Python 3.6+ thank heavens but ...
>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 6500:
> invalid start byte
>
> It just so happens that is the superscript 3 character.  It also happens
> that superscript 3 displays correctly and works properly on Windows 10 but
> causes the above error on Ubuntu 18.04. I'm not paid enough to understand
> why - hence this email if anyone can help.
>
> My current pain is because I'm pumping data into a database (PostgreSQL)
> which needs such measures as 5µg/m³ and Python hates me.
>
> I think there is a valid argument for the Python utf-8 codec to
> "special-case" subscript and superscript numeral unicode collisions with
> ASCII or whatever Windows 10 uses. That would cover maths and chemistry
> both. And save me a lot of pain.
>
> Thanks for any sympathy and many, many thanks for help on getting past
> this.
>
> Cheers
>
> Mike
>
> PS: I use superscript and subscript numbers all the time because I'm
> involved with chemical data. Here is how I usually deal with it ...
>
>
>
> from django.utils.encoding import smart_text
> from django.utils.safestring import mark_safe
>
>
> def subscript_to_ascii(raw=None):
>     """Swap subscript unicode chars into ordinary numbers for
>     synonym searches.
>     """
>     formula = ""
>     clear = True
>     if raw is not None:
>         # for char in str(raw):
>         for char in raw:
>             if char == "[":
>                 clear = False  # permits [1] footnote references
>             elif char == "]":
>                 clear = True
>             if clear:
>                 if char == "\u2082":
>                     char = "2"
>                 elif char == "\u2083":
>                     char = "3"
>                 elif char == "\u2084":
>                     char = "4"
>                 elif char == "\u2085":
>                     char = "5"
>                 elif char == "\u2086":
>                     char = "6"
>                 elif char == "\u2087":
>                     char = "7"
>                 elif char == "\u2088":
>                     char = "8"
>                 elif char == "\u2089":
>                     char = "9"
>                 elif char == "\u2081":
>                     char = "1"
>                 elif char == "\u2080":
>                     char = "0"
>             formula += char
>     return smart_text(formula)
>
>
> def subscript(raw=None):
>     """Swap ordinary numbers for subscript unicode chars."""
>     formula = ""
>     clear = True
>     if raw is not None:
>         for char in raw:
>             if char == "[":
>                 clear = False  # permits [1] footnote references
>             elif char == "]":
>                 clear = True
>             if clear:
>                 if char == "2":
>                     char = "\u2082"
>                 elif char == "3":
>                     char = "\u2083"
>                 elif char == "4":
>                     char = "\u2084"
>                 elif char == "5":
>                     char = "\u2085"
>                 elif char == "6":
>                     char = "\u2086"
>                 elif char == "7":
>                     char = "\u2087"
>                 elif char == "8":
>                     char = "\u2088"
>                 elif char == "9":
>                     char = "\u2089"
>                 elif char == "1":
>                     char = "\u2081"
>                 elif char == "0":
>                     char = "\u2080"
>             formula += char
>     return smart_text(formula.encode("utf8"))
>
>
> lc50 = subscript(LC50)
> ld50 = subscript(LD50)
>
>
> def safesubscript(raw=None, ascii=False):
>     """Uses marksafe to subscript instead of unicode chars. This looks
>     better on screen but cannot be used in places.
>     """
>     formula = ""
>     clear = True
>     if raw is not None:
>         for char in raw:
>             if char == "[":
>                 # don"t process any more digits just add to formula
>                 clear = False  # permits [1] footnote references
>             elif char == "]":
>                 # start processing again
>                 clear = True
>             if clear:
>                 if char == "2" or char == "\u2082":
>                     char = "<sub>2</sub>"
>                 elif char == "3" or char == "\u2083":
>                     char = "<sub>3</sub>"
>                 elif char == "4" or char == "\u2084":
>                     char = "<sub>4</sub>"
>                 elif char == "5" or char == "\u2085":
>                     char = "<sub>5</sub>"
>                 elif char == "6" or char == "\u2086":
>                     char = "<sub>6</sub>"
>                 elif char == "7" or char == "\u2087":
>                     char = "<sub>7</sub>"
>                 elif char == "8" or char == "\u2088":
>                     char = "<sub>8</sub>"
>                 elif char == "9" or char == "\u2089":
>                     char = "<sub>9</sub>"
>                 elif char == "1" or char == "\u2081":
>                     char = "<sub>1</sub>"
>                 elif char == "0" or char == "\u2080":
>                     char = "<sub>0</sub>"
>             formula += char
>     if ascii:
>         formula = formula.replace("<sub>", "").replace("</sub", "")
>     return mark_safe(smart_text(formula))
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> melbourne-pug mailing 
> listmelbourne-pug@python.orghttps://mail.python.org/mailman/listinfo/melbourne-pug
>
>
> _______________________________________________
> melbourne-pug mailing list
> melbourne-pug@python.org
> https://mail.python.org/mailman/listinfo/melbourne-pug
>

_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug

Re: [melbourne-pug] Superscript chars are a pain

Reply via email to