Oh well ... maybe it isn't Python's fault. I just looked at the data input file and found the ³ character in all places had been turned into a box. When I edited the boxes back into ³ it all went well.

I used Filezilla to get the input files across so I'll focus on that next.

Sorry to interrupt your long weekend.

Cheers

Mike


On 8/03/2020 5:30 pm, Mike Dewhirst wrote:
I'm now exclusively Python 3.6+ thank heavens but ...

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 6500: invalid start byte

It just so happens that is the superscript 3 character.  It also happens that superscript 3 displays correctly and works properly on Windows 10 but causes the above error on Ubuntu 18.04. I'm not paid enough to understand why - hence this email if anyone can help.

My current pain is because I'm pumping data into a database (PostgreSQL) which needs such measures as 5µg/m³ and Python hates me.

I think there is a valid argument for the Python utf-8 codec to "special-case" subscript and superscript numeral unicode collisions with ASCII or whatever Windows 10 uses. That would cover maths and chemistry both. And save me a lot of pain.

Thanks for any sympathy and many, many thanks for help on getting past this.

Cheers

Mike

PS: I use superscript and subscript numbers all the time because I'm involved with chemical data. Here is how I usually deal with it ...



from django.utils.encoding import smart_text
from django.utils.safestring import mark_safe


def subscript_to_ascii(raw=None):
    """Swap subscript unicode chars into ordinary numbers for
    synonym searches.
    """
    formula = ""
    clear = True
    if raw is not None:
        # for char in str(raw):
        for char in raw:
            if char == "[":
                clear = False  # permits [1] footnote references
            elif char == "]":
                clear = True
            if clear:
                if char == "\u2082":
                    char = "2"
                elif char == "\u2083":
                    char = "3"
                elif char == "\u2084":
                    char = "4"
                elif char == "\u2085":
                    char = "5"
                elif char == "\u2086":
                    char = "6"
                elif char == "\u2087":
                    char = "7"
                elif char == "\u2088":
                    char = "8"
                elif char == "\u2089":
                    char = "9"
                elif char == "\u2081":
                    char = "1"
                elif char == "\u2080":
                    char = "0"
            formula += char
    return smart_text(formula)


def subscript(raw=None):
    """Swap ordinary numbers for subscript unicode chars."""
    formula = ""
    clear = True
    if raw is not None:
        for char in raw:
            if char == "[":
                clear = False  # permits [1] footnote references
            elif char == "]":
                clear = True
            if clear:
                if char == "2":
                    char = "\u2082"
                elif char == "3":
                    char = "\u2083"
                elif char == "4":
                    char = "\u2084"
                elif char == "5":
                    char = "\u2085"
                elif char == "6":
                    char = "\u2086"
                elif char == "7":
                    char = "\u2087"
                elif char == "8":
                    char = "\u2088"
                elif char == "9":
                    char = "\u2089"
                elif char == "1":
                    char = "\u2081"
                elif char == "0":
                    char = "\u2080"
            formula += char
    return smart_text(formula.encode("utf8"))


lc50 = subscript(LC50)
ld50 = subscript(LD50)


def safesubscript(raw=None, ascii=False):
    """Uses marksafe to subscript instead of unicode chars. This looks
    better on screen but cannot be used in places.
    """
    formula = ""
    clear = True
    if raw is not None:
        for char in raw:
            if char == "[":
                # don"t process any more digits just add to formula
                clear = False  # permits [1] footnote references
            elif char == "]":
                # start processing again
                clear = True
            if clear:
                if char == "2" or char == "\u2082":
                    char = "<sub>2</sub>"
                elif char == "3" or char == "\u2083":
                    char = "<sub>3</sub>"
                elif char == "4" or char == "\u2084":
                    char = "<sub>4</sub>"
                elif char == "5" or char == "\u2085":
                    char = "<sub>5</sub>"
                elif char == "6" or char == "\u2086":
                    char = "<sub>6</sub>"
                elif char == "7" or char == "\u2087":
                    char = "<sub>7</sub>"
                elif char == "8" or char == "\u2088":
                    char = "<sub>8</sub>"
                elif char == "9" or char == "\u2089":
                    char = "<sub>9</sub>"
                elif char == "1" or char == "\u2081":
                    char = "<sub>1</sub>"
                elif char == "0" or char == "\u2080":
                    char = "<sub>0</sub>"
            formula += char
    if ascii:
        formula = formula.replace("<sub>", "").replace("</sub", "")
    return mark_safe(smart_text(formula))










_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug

_______________________________________________
melbourne-pug mailing list
melbourne-pug@python.org
https://mail.python.org/mailman/listinfo/melbourne-pug

Reply via email to