Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

sebb Thu, 07 Sep 2017 03:01:48 -0700

On 7 September 2017 at 06:06, Daniel Gruno <[email protected]> wrote:
> On 09/07/2017 12:24 AM, sebb wrote:
>> On 6 September 2017 at 07:32, Daniel Gruno <[email protected]> wrote:
>>> On 09/06/2017 12:09 AM, sebb wrote:
>>>> On 2 September 2017 at 09:02,  <[email protected]> wrote:
>>>>> Repository: incubator-ponymail
>>>>> Updated Branches:
>>>>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>>>>
>>>>>
>>>>> crop out trailing whitespace for redundant archiver
>>>>>
>>>>> This deals with spurious whitespace that can exist on
>>>>> clustered setups due to corrections inside the MTAs.
>>>>> This only deals with trailing whitespace, everything else
>>>>> is preserved.
>>>>
>>>> -1
>>>>
>>>> I don't think this is a good idea.
>>>
>>> Your -1 is noted, but I don't consider the reasoning valid for a veto,
>>> so I'll interpret this as just a plain -1.
>>
>> AIUI, that's not your call.
>
> It's not my call to determine whether technical merit is sound (that
> would be for the PPMC in such cases), but there has to be technical
> merit in -1 in the first place. Saying "this is not a good idea" does
> not convey technical reason. You've since elaborated on that in your
> reply, and _that_ I believe constitutes a technical reason.


Ah ok.

I guess I was too terse, I should have linked to the previous mails I
sent about the same issue.

>>
>>> I think it's a good idea, I think it solves some real problems that have
>>> been spotted in clustered setup. It could also solve problems where one
>>> archives as mbox with an extra newline by mistake. It's also an optional
>>> generator, not the default. Could you elaborate on why trailing
>>> whitespace would matter?
>>
>> I already wrote that ignoring whitespace causes a problem because it
>> means two different inputs end up with the same database id.
>> There's no way of knowing which one was correct; the wrong one may end
>> up being stored.
>
> But they would both have the same sender, date, list, message,
> attachments etc filed under the same ID - is that not what we want? What
> we _don't_ want is for trailing whitespace to cause duplicates. Put in
> other words: Why would we at all care whether one has the added newline
> or two and the other one doesn't? We're dealing with showing people
> emails, but bit-perfect of what was sent (including duplicates as a
> result of bit-diversion), but rather of what was intended.

I disagree; I think it's important to show the input email as exactly
as possible.
Whitespace trimming could damage some emails.

> If we wanted
> a perfect copy, we'd use the full digest and skip clustered setups all
> together, hoping machines don't die on us.

Not so, it must be possible to have perfect copies in clustered setups.
Otherwise clustered backup systems would be impossible.
It's just that the current design may make this tricky.

> This is for those rare
> occasions where something _does_ go wrong, and as seen, sometimes
> postfix will add some extra newlines - I still don't know why it does
> that in every case, I only know that it does, and likely other MTAs do
> as well.

That's largely my point.
The cause needs to be determined otherwise the generator is being used
to ignore what may be a bug.

Besides, in the cases I have seen (and noted on this list), it is not
only a difference in trailing whitespace.
The archived-at header is missing in one of the copies.
As I have written already, that points to non-identical treatment by
the different cluster members.

>
>>
>> If messages are being treated differently by different parts of the
>> cluster then that needs to be addressed.
>>
>>>>
>>>> Please raise an issue for significant changes such as this.
>>>
>>> I don't see it as significant changes - it's an optional generator that
>>> we haven't released, it's not gonna break any API/ABI we currently have.
>>
>> Once a generator has been released, it effectively becomes part of the
>> public API.
>> It's then not possible to fix it without breaking the API.
>> So any problems with the code must be fixed before release.
>>
>>> If I am to raise a ticket, it would be for the full generator to be
>>> admitted as an option.
>>
>> That is definitely a separate ticket.
>>
>>>>
>>>>> Also add some spacing in the code.
>>>>>
>>>>>
>>>>> Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
>>>>> Commit: 
>>>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
>>>>> Tree: 
>>>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
>>>>> Diff: 
>>>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1
>>>>>
>>>>> Branch: refs/heads/master
>>>>> Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
>>>>> Parents: c8f4d3b
>>>>> Author: Daniel Gruno <[email protected]>
>>>>> Authored: Sat Sep 2 10:02:10 2017 +0200
>>>>> Committer: Daniel Gruno <[email protected]>
>>>>> Committed: Sat Sep 2 10:02:10 2017 +0200
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>>  tools/generators.py | 9 +++++++++
>>>>>  1 file changed, 9 insertions(+)
>>>>> ----------------------------------------------------------------------
>>>>>
>>>>>
>>>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
>>>>> ----------------------------------------------------------------------
>>>>> diff --git a/tools/generators.py b/tools/generators.py
>>>>> index e320a06..b37fa8b 100644
>>>>> --- a/tools/generators.py
>>>>> +++ b/tools/generators.py
>>>>> @@ -22,6 +22,7 @@ This file contains the various ID generators for Pony 
>>>>> Mail's archivers.
>>>>>  import hashlib
>>>>>  import email.utils
>>>>>  import time
>>>>> +import re
>>>>>
>>>>>  # Full generator: uses the entire email (including server-dependent data)
>>>>>  # This is the recommended generator for single-node setups.
>>>>> @@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
>>>>>      if not body: # Make sure body is not None, which will fail.
>>>>>          body = ""
>>>>>      xbody = body if type(body) is bytes else body.encode('ascii', 
>>>>> 'ignore')
>>>>> +
>>>>> +    # Crop out any trailing whitespace in body
>>>>> +    xbody = re.sub(b"\s+$", b"", xbody)
>>>>> +
>>>>>      # Use List ID
>>>>>      xbody += bytes(lid, encoding='ascii')
>>>>> +
>>>>>      # Use Date header. Don't use archived-at, as the archiver sets this 
>>>>> if not present.
>>>>>      mdate = None
>>>>>      mdatestring = "(null)" # Default to null, ONLY changed if replicable 
>>>>> across imports
>>>>> @@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
>>>>>      except:
>>>>>          pass
>>>>>      xbody += bytes(mdatestring, encoding='ascii')
>>>>> +
>>>>>      # Use sender
>>>>>      sender = msg.get('from', None)
>>>>>      if sender:
>>>>>          xbody += bytes(sender, encoding = 'ascii')
>>>>> +
>>>>>      # Use subject
>>>>>      subject = msg.get('subject', None)
>>>>>      if subject:
>>>>>          xbody += bytes(subject, encoding = 'ascii')
>>>>> +
>>>>>      # Use attachment hashes if present
>>>>>      if attachments:
>>>>>          for a in attachments:
>>>>>
>>>
>

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Reply via email to