AFAICT this will generate different hashes for the same message if
they are loaded from a different source.

Whilst it should ensure that distinct messages don't clash, it won't
weed out actual duplicates.

Ideally the parts that vary between duplicates from different sources
should be omitted from the hash.

However it is better than the current situation.

On Tue, 25 Aug 2020 at 19:02, <[email protected]> wrote:
>
> This is an automated email from the ASF dual-hosted git repository.
>
> humbedooh pushed a commit to branch master
> in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
>
>
> The following commit(s) were added to refs/heads/master by this push:
>      new 8a1baf1  store sources as sha3-256 of themselves, add a permalink 
> reference to the digested doc.
> 8a1baf1 is described below
>
> commit 8a1baf16c64933f1fb7a9d281df8ebec418cf771
> Author: Daniel Gruno <[email protected]>
> AuthorDate: Tue Aug 25 20:02:44 2020 +0200
>
>     store sources as sha3-256 of themselves, add a permalink reference to the 
> digested doc.
>
>     Conform to what archiver.py does.
>
>     This allows us to store multiple copies of the same digested email, if
>     received more than once (and if the raw data differs).
> ---
>  tools/import-mbox.py | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/tools/import-mbox.py b/tools/import-mbox.py
> index 482a2b8..663e6da 100755
> --- a/tools/import-mbox.py
> +++ b/tools/import-mbox.py
> @@ -38,6 +38,7 @@ from os.path import isdir, isfile, join
>  from threading import Lock, Thread
>  from urllib.request import urlopen
>
> +
>  import archiver
>  from plugins.elastic import Elastic
>
> @@ -208,6 +209,7 @@ class SlurpThread(Thread):
>              for key in messages.iterkeys():
>                  message = messages.get(key)
>                  message_raw = messages.get_bytes(key)
> +                sha3 = hashlib.sha3_256(message_raw).hexdigest()
>                  # If --filter is set, discard any messages not matching by 
> continuing to next email
>                  if (
>                      fromFilter
> @@ -313,9 +315,8 @@ class SlurpThread(Thread):
>                      try:  # temporary hack to try and find an encoding issue
>                          # needs to be replaced by proper exception handling
>                          json_source = {
> -                            "mid": json[
> -                                "mid"
> -                            ],  # needed for bulk-insert only, not needed in 
> database
> +                            "permalink": json["mid"],
> +                            "mid": sha3,
>                              "message-id": json["message-id"],
>                              "source": archiver.mbox_source(raw_msg),
>                          }
>

Reply via email to