On 05/09/2020 17.33, sebb wrote:
On Sat, 5 Sep 2020 at 08:54, <[email protected]> wrote:

This is an automated email from the ASF dual-hosted git repository.

humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git


The following commit(s) were added to refs/heads/master by this push:
      new fafc765  Refactor, drop the double decode attempt.
fafc765 is described below

commit fafc7651d9d02dfde727bd1f0da13722de8b3c38
Author: Daniel Gruno <[email protected]>
AuthorDate: Sat Sep 5 09:54:03 2020 +0200

     Refactor, drop the double decode attempt.

     We should assume US-ASCII, but if it's not, it's quicker,
     processing-wise, to immediately fall back to utf-8 instead of trying to
     first determine if it is indeed UTF-8-worthy. Either it'll work as
     US-ASCII, or it will work with the UTF-8 with 'replace'.

This info belongs in the code.


And it was put in the code as well. If you don't like me doing long git comments, just say that.

---
  tools/archiver.py | 13 ++++++++-----
  1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/tools/archiver.py b/tools/archiver.py
index f875cc3..c52207b 100755
--- a/tools/archiver.py
+++ b/tools/archiver.py
@@ -192,12 +192,15 @@ class Body:
                          break
                      except UnicodeDecodeError:
                          pass
+            # If no character set was defined, the email MUST be US-ASCII by 
RFC822 defaults
+            # This isn't always the case, as we're about to discover.
              if not self.string:
-                self.string = self.bytes.decode("us-ascii", errors="replace")
-                if valid_encodings:
-                    self.character_set = "us-ascii"
-                # If no character encoding, but we find non-ASCII chars, 
assume bytes were UTF-8
-                elif len(self.bytes) != len(self.bytes.decode("us-ascii", 
"ignore")):
+                try:
+                    self.string = self.bytes.decode("us-ascii", 
errors="strict")
+                    if valid_encodings:
+                        self.character_set = "us-ascii"
+                except UnicodeDecodeError:
+                    # If us-ascii strict fails, it's probably undeclared UTF-8.
                      # Set the .string, but not a character set, as we don't 
know it for sure.
                      # This is mainly so the older generators won't barf.
                      self.string = self.bytes.decode("utf-8", "replace")


Reply via email to