On 25/08/2020 19.56, sebb wrote:
On Tue, 25 Aug 2020 at 18:49, Daniel Gruno <[email protected]> wrote:

On 25/08/2020 19.43, sebb wrote:
On Tue, 25 Aug 2020 at 17:47, Daniel Gruno <[email protected]> wrote:

On 25/08/2020 18.37, sebb wrote:
-1

This will likely change the generated ID for emails that already have
archived-at headers

No, it will not change that, as they already will have it. This is
fixing an issue with randomness in the archiving.

See
https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L816

Yes, that adds an archived-at header if the mail does not already have one.
Since this is added before the generator is called, it may cause
generators to produce different output.
This particularly affects the full generator.

Right, if I have an email without the archived-at header, then reloading
that will produce a different result every time. That's what I was
hoping to avoid.

Which is easy to do - don't add the archived-at header, at least not
until the generator has been called.

Right, which is what I am planning to do. I don't think it'll change anything (in foal at least), as the addition doesn't play into the source at all there.


This will also fix the issue for other generators that may make use of
the header.


If the message already has an archived-at header, then the full
generator output should be stable, provided it is given the same
input.

But if the input is switched to raw bytes from the reconstructed
message, this will most likely change it, unnecessarily changing at
least some ids.

As I just have discovered, this is in fact happening, sort of, because
.as_bytes does (unintended) normalization of certain headers :\

I guess we'll have to rely on .as_bytes to never change that behavior
instead, for the full generator. Unrelated to your -1, but still
annoying as heck.

Yes, that's a separate issue.
Future versions may or may not be different.
But at least if users are happy to stay on the current version the
output will be the same.

I've also noticed that the unit tests are spitting out a toooon of error
messages that I can't quite understand. unrelated to the test results.



What happens is, an archived-at with *the current timestamp* will get
added to all messages that do not have it, so it's not really helping at
all with anything when you use the .as_bytes, as that data will always
be different unless there already is such a header.

The raw_msg will deliver a much more reliable result going forward, as
it doesn't add "random" data to the mix on each reload (because it is
the original raw data without headers appended on the fly).

I hope this clears matters up.


Far better to ensure the archived-at header is not added to the parsed
mail before the generator is used.

On Tue, 25 Aug 2020 at 17:32, <[email protected]> wrote:

This is an automated email from the ASF dual-hosted git repository.

humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git


The following commit(s) were added to refs/heads/master by this push:
        new 6dfb9d8  Use raw_msg instead of as_bytes, as the latter has a new 
archived-at appended sometimes (and we don't want that)
6dfb9d8 is described below

commit 6dfb9d83b1fa6e0ae83bc61446a27fe555751f8c
Author: Daniel Gruno <[email protected]>
AuthorDate: Tue Aug 25 18:32:33 2020 +0200

       Use raw_msg instead of as_bytes, as the latter has a new archived-at 
appended sometimes (and we don't want that)
---
    tools/plugins/generators.py | 8 ++++----
    1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/plugins/generators.py b/tools/plugins/generators.py
index 949a3b6..344f115 100644
--- a/tools/plugins/generators.py
+++ b/tools/plugins/generators.py
@@ -151,7 +151,7 @@ def dkim(_msg, _body, lid, _attachments, raw_msg):
    # Full generator: uses the entire email (including server-dependent data)
    # Used by default until August 2020.
    # See 'dkim' for recommended generation.
-def full(msg, _body, lid, _attachments, _raw_msg):
+def full(_msg, _body, lid, _attachments, raw_msg):
        """
        Full generator: uses the entire email
        (including server-dependent data)
@@ -159,15 +159,15 @@ def full(msg, _body, lid, _attachments, _raw_msg):
        but different copies of the message are likely to have different 
headers, thus ids

        Parameters:
-    msg - the parsed message
+    msg - the parsed message (not used)
        _body - the parsed text content (not used)
        lid - list id
        _attachments - list of attachments (not used)
-    _raw_msg - the original message bytes (not used)
+    raw_msg - the original message bytes

        Returns: "<hash>@<lid>" where hash is sha224 of message bytes
        """
-    mid = "%s@%s" % (hashlib.sha224(msg.as_bytes()).hexdigest(), lid)
+    mid = "%s@%s" % (hashlib.sha224(raw_msg).hexdigest(), lid)
        return mid






Reply via email to