[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

GitBox Mon, 14 Sep 2020 03:04:19 -0700


sbp commented on pull request #517:
URL: 
https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691953754

Imagine we have a small mailing list archive in an mbox file called
`ancient.mbox`. This mbox archive contains three emails, none of which contain
a List-ID header. We import it using the manual command line List-ID
`alt.small.archive`, and use the DKIM-ID generator that appends the manual
List-ID. The permalinks are like this:

```
https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj_alt.small.archive
https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct_alt.small.archive
https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v_alt.small.archive
```

Now the catastrophic scenario happens, and the Ponymail database is lost!
But we still have `ancient.mbox`, and we want to restore its three emails back
into the `lists.example.org` Ponymail instance. How do we know what manual
List-ID to use? The three emails do not include a List-ID. We made up
`alt.small.archive`, but we did not record this fact and we no longer remember
it. How do we get the List-ID?

The idea behind putting `_alt.small.archive` in the permalinks is that now
we can send a plea to our users asking: "does anybody have any links to an
email that was in the database I just lost?", or use a search engine to try to
find such permalinks.

There are a couple of problems with this:

* If the archive is small, what if *nobody, including search engines*, ever
recorded those links? Our mailing list archives may be unpopular, private, or
hidden from search engines using `robots.txt`.
* Whom do we consult to find those permalinks? In other words, how do we
even know who our users are? For sites that have a community around them, there
may be a straightforward answer to this. But there is not a *general answer* to
this.

Therefore, the only reason to put a manual List-ID in permalinks is to
support an **unreliable backup strategy**. The strategy is unreliable because
it depends on arbitrary users to retain copies of the permalinks that can then
be consulted to restore the data in the case of catastrophic database loss. And
an unreliable backup strategy is made **unacceptable** when there is a reliable
alternative.

One alternative is that we can just rename `archive.mbox` to
`alt.small.archive.mbox` when we import it. Or we can record the hash of
`archive.mbox` into a file called `alt.small.archive.mbox-sha3` and keep it
alongside `archive.mbox`. But those approaches have drawbacks too, e.g. if we
obtain an mbox file which is differently ordered.

Instead, here is a reliable alternative:

Imagine we import our three emails from `ancient.mbox`, but this time
*without* a manual List-ID in the permalinks. The permalinks are like this:

```
https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct
https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v
```

When we performed this import, we generated the three DKIM-IDs
`bnbqz6hb4gplpvtz7zmlhymj`, `jnb2msdg3j6of3dco2bw5fct`, and
`oxbrv2q2wbd23vfz5ttfcs7v`. These are each encodings of 16 bytes, for a total
of 48 bytes. In general this is `16 * n` bytes, where `n` is the number of
emails imported. We store these bytes in a file called
`alt.small.archive.dkim-ids`.

We now perform *standard backup procedures* for
`alt.small.archive.dkim-ids`. We replicate it across environments, storing as
many copies as possible in different geographic locations using different
setups. This is easy to do because the file is only *48 bytes* long. We only
need to store *48 bytes*, several times, to have a reliable backup of our
manual List-ID. We can even include the manual List-ID plus line feed at the
start of the file, so that we're not relying on the filename itself.

Does this strategy scale? Consider a very large mailing list that has a
million emails in it. The manual List-ID `.dkim-ids` backup file for such a
list would be `16 * n` or `16 * 1,000,000` or `16,000,000` bytes long. This is
only `15.2 MiB`. As of 2020 it is trivial to widely and reliably replicate
fifteen mebibytes for backup purposes.

What are the problems with this strategy? Unlike the unreliable and
unacceptable backup strategy described above, it does *not* rely on arbitrary
users or search engines to backup our data for us. It does *not* lead to the
problem of wondering who to consult to restore that data. It follows
established, standard industry practices for backing up our manual List-IDs,
instead of the existing *ad hoc* and *idiosyncratic* method.

For that reason, I could never recommend the strategy where manual List-IDs
are part of the permalinks. I could never recommend that people use it as their
backup strategy, because this superior strategy is available instead and it
ticks all the boxes.

It is, however, sometimes necessary to include the manual List-ID in the URL
somewhere for UI purposes. Consider the email `bnbqz6hb4gplpvtz7zmlhymj` above.
Let's say that as well as appearing in our `ancient.mbox` archive it was also
sent to five other mailing lists, for a total of six, all of which are in the
`lists.example.org` Ponymail instance. What should the UI say if a user visits
the following address?

```
https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
```

It doesn't know which List-ID to present to the user. In fact, in Ponymail
and in Foal it doesn't even *retain the information* that this message was sent
to six mailing lists if the DKIM-ID is the primary permalink, which is
necessary in Ponymail if DKIM-ID is used at all, and is not necessary in Foal
but it still possible.

This would not be a problem if the manual List-ID were part of the
permalink, but it solves one problem and causes another. DKIM-IDs were designed
to deduplicate emails. If List-IDs are part of the DKIM-ID permalink, this
means we would have to store *six* copies of the `bnbqz6hb4gplpvtz7zmlhymj`
metadata, and *six* copies of its source too. But DKIM-IDs were explicitly
designed to prevent this. Therefore we should solve the problem of retaining
manual List-IDs another way. If we added them to the hash input of DKIM-IDs
then we would lose our reliable backup strategy presented above.

Thankfully there is a simple solution.

In Foal commit
<code>[178b729](https://github.com/apache/incubator-ponymail-foal/commit/178b729b9084a83034c0a87f150f23fd2ca48291)</code>,
the multiple ID generators feature was added with the field `permalinks`. To
support multiple manual List-IDs for a DKIM-ID identified message, all that
would be required is to have an analogous field called `lids` for an array of
List-IDs, just like `permalinks` is an array of generated permalink IDs.

Then, if the user browses the URL above:

```
https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
```

They can be presented with a list showing all List-IDs that this message
belongs to, and the option to display the message *in its context in those
lists*, specialising its UI. Or, they can still browse a version that contains
the List-ID:

```
https://lists.example.org/alt.small.archive/t/bnbqz6hb4gplpvtz7zmlhymj
```

But, importantly, `alt.small.archive` is not part of the DKIM-ID here. This
means that messages are still deduplicated even when they appear in multiple
mailing lists.

To argue that manual List-IDs should be part of DKIM-IDs would remove *all*
of the above. In particular:

* Metadata and sources would be duplicated across mailing lists
* Showing what lists a DKIM-ID appears in would require a prefix search
incompatible with elasticsearch keyword arguments
* Ponymail administrators would be induced to just rely on their users to
backup the permalinks in case of catastrophic data loss instead of performing
the reliable backup method described above

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Reply via email to