On 25/08/2020 20.15, sebb wrote:
AFAICT this will generate different hashes for the same message if
they are loaded from a different source.

Yeah, it will - at present, that is on purpose. We can look at doing something like using Sean's DKIM parser for this, and only hashing the output from that, with the x-archived-list-id added in from the command line --lid argument if different from the canonical list id.


Whilst it should ensure that distinct messages don't clash, it won't
weed out actual duplicates.

Right, aware of that. In most cases, if you are reloading, you are doing so with a fresh DB, and it won't matter much. In cases where you are "cascading" mbox files, it would make duplicates, but that's only a question of disk space for now, having duplicate source files won't cause malfunctions, just a few more bytes used and source alternatives.


Ideally the parts that vary between duplicates from different sources
should be omitted from the hash.

However it is better than the current situation.

On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote:

This is an automated email from the ASF dual-hosted git repository.

humbedooh pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git


The following commit(s) were added to refs/heads/master by this push:
      new 8a1baf1  store sources as sha3-256 of themselves, add a permalink 
reference to the digested doc.
8a1baf1 is described below

commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771
Author: Daniel Gruno <[email protected]>
AuthorDate: Tue Aug 25 20:02:44 2020 +0200

     store sources as sha3-256 of themselves, add a permalink reference to the 
digested doc.

     Conform to what archiver.py does.

     This allows us to store multiple copies of the same digested email, if
     received more than once (and if the raw data differs).
---
  tools/import-mbox.py | 7 ++++---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/import-mbox.py b/tools/import-mbox.py
index 482a2b8..663e6da 100755
--- a/tools/import-mbox.py
+++ b/tools/import-mbox.py
@@ -38,6 +38,7 @@ from os.path import isdir, isfile, join
  from threading import Lock, Thread
  from urllib.request import urlopen

+
  import archiver
  from plugins.elastic import Elastic

@@ -208,6 +209,7 @@ class SlurpThread(Thread):
              for key in messages.iterkeys():
                  message = messages.get(key)
                  message_raw = messages.get_bytes(key)
+                sha3 = hashlib.sha3_256(message_raw).hexdigest()
                  # If --filter is set, discard any messages not matching by 
continuing to next email
                  if (
                      fromFilter
@@ -313,9 +315,8 @@ class SlurpThread(Thread):
                      try:  # temporary hack to try and find an encoding issue
                          # needs to be replaced by proper exception handling
                          json_source = {
-                            "mid": json[
-                                "mid"
-                            ],  # needed for bulk-insert only, not needed in 
database
+                            "permalink": json["mid"],
+                            "mid": sha3,
                              "message-id": json["message-id"],
                              "source": archiver.mbox_source(raw_msg),
                          }


Reply via email to