sbp commented on pull request #517:
URL: 
https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691953754


   Imagine we have a small mailing list archive in an mbox file called 
`ancient.mbox`. This mbox archive contains three emails, none of which contain 
a List-ID header. We import it using the manual command line List-ID 
`alt.small.archive`, and use the DKIM-ID generator that appends the manual 
List-ID. The permalinks are like this:
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj_alt.small.archive
   https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct_alt.small.archive
   https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v_alt.small.archive
   ```
   
   Now the catastrophic scenario happens, and the Ponymail database is lost! 
But we still have `ancient.mbox`, and we want to restore its three emails back 
into the `lists.example.org` Ponymail instance. How do we know what manual 
List-ID to use? The three emails do not include a List-ID. We made up 
`alt.small.archive`, but we did not record this fact and we no longer remember 
it. How do we get the List-ID?
   
   The idea behind putting `_alt.small.archive` in the permalinks is that now 
we can send a plea to our users asking: "does anybody have any links to an 
email that was in the database I just lost?", or use a search engine to try to 
find such permalinks.
   
   There are a couple of problems with this:
   
   * If the archive is small, what if *nobody, including search engines*, ever 
recorded those links? Our mailing list archives may be unpopular, private, or 
hidden from search engines using `robots.txt`.
   * Whom do we consult to find those permalinks? In other words, how do we 
even know who our users are? For sites that have a community around them, there 
may be a straightforward answer to this. But there is not a *general answer* to 
this.
   
   Therefore, the only reason to put a manual List-ID in permalinks is to 
support an **unreliable backup strategy**. The strategy is unreliable because 
it depends on arbitrary users to retain copies of the permalinks that can then 
be consulted to restore the data in the case of catastrophic database loss. And 
an unreliable backup strategy is made **unacceptable** when there is a reliable 
alternative.
   
   One alternative is that we can just rename `archive.mbox` to 
`alt.small.archive.mbox` when we import it. Or we can record the hash of 
`archive.mbox` into a file called `alt.small.archive.mbox-sha3` and keep it 
alongside `archive.mbox`. But those approaches have drawbacks too, e.g. if we 
obtain an mbox file which is differently ordered.
   
   Instead, here is a reliable alternative:
   
   Imagine we import our three emails from `ancient.mbox`, but this time 
*without* a manual List-ID in the permalinks. The permalinks are like this:
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
   https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct
   https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v
   ```
   
   When we performed this import, we generated the three DKIM-IDs 
`bnbqz6hb4gplpvtz7zmlhymj`, `jnb2msdg3j6of3dco2bw5fct`, and 
`oxbrv2q2wbd23vfz5ttfcs7v`. These are each encodings of 16 bytes, for a total 
of 48 bytes. In general this is `16 * n` bytes, where `n` is the number of 
emails imported. We store these bytes in a file called 
`alt.small.archive.dkim-ids`.
   
   We now perform *standard backup procedures* for 
`alt.small.archive.dkim-ids`. We replicate it across environments, storing as 
many copies as possible in different geographic locations using different 
setups. This is easy to do because the file is only *48 bytes* long. We only 
need to store *48 bytes*, several times, to have a reliable backup of our 
manual List-ID. We can even include the manual List-ID plus line feed at the 
start of the file, so that we're not relying on the filename itself.
   
   Does this strategy scale? Consider a very large mailing list that has a 
million emails in it. The manual List-ID `.dkim-ids` backup file for such a 
list would be `16 * n` or `16 * 1,000,000` or `16,000,000` bytes long. This is 
only `15.2 MiB`. As of 2020 it is trivial to widely and reliably replicate 
fifteen mebibytes for backup purposes.
   
   What are the problems with this strategy? Unlike the unreliable and 
unacceptable backup strategy described above, it does *not* rely on arbitrary 
users or search engines to backup our data for us. It does *not* lead to the 
problem of wondering who to consult to restore that data. It follows 
established, standard industry practices for backing up our manual List-IDs, 
instead of the existing *ad hoc* and *idiosyncratic* method.
   
   For that reason, I could never recommend the strategy where manual List-IDs 
are part of the permalinks. I could never recommend that people use it as their 
backup strategy, because this superior strategy is available instead and it 
ticks all the boxes.
   
   It is, however, sometimes necessary to include the manual List-ID in the URL 
somewhere for UI purposes. Consider the email `bnbqz6hb4gplpvtz7zmlhymj` above. 
Let's say that as well as appearing in our `ancient.mbox` archive it was also 
sent to five other mailing lists, for a total of six, all of which are in the 
`lists.example.org` Ponymail instance. What should the UI say if a user visits 
the following address?
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
   ```
   
   It doesn't know which List-ID to present to the user. In fact, in Ponymail 
and in Foal it doesn't even *retain the information* that this message was sent 
to six mailing lists if the DKIM-ID is the primary permalink, which is 
necessary in Ponymail if DKIM-ID is used at all, and is not necessary in Foal 
but it still possible. 
   
   This would not be a problem if the manual List-ID were part of the 
permalink, but it solves one problem and causes another. DKIM-IDs were designed 
to deduplicate emails. If List-IDs are part of the DKIM-ID permalink, this 
means we would have to store *six* copies of the `bnbqz6hb4gplpvtz7zmlhymj` 
metadata, and *six* copies of its source too. But DKIM-IDs were explicitly 
designed to prevent this. Therefore we should solve the problem of retaining 
manual List-IDs another way. If we added them to the hash input of DKIM-IDs 
then we would lose our reliable backup strategy presented above.
   
   Thankfully there is a simple solution.
   
   In Foal commit 
<code>[178b729](https://github.com/apache/incubator-ponymail-foal/commit/178b729b9084a83034c0a87f150f23fd2ca48291)</code>,
 the multiple ID generators feature was added with the field `permalinks`. To 
support multiple manual List-IDs for a DKIM-ID identified message, all that 
would be required is to have an analogous field called `lids` for an array of 
List-IDs, just like `permalinks` is an array of generated permalink IDs.
   
   Then, if the user browses the URL above:
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
   ```
   
   They can be presented with a list showing all List-IDs that this message 
belongs to, and the option to display the message *in its context in those 
lists*, specialising its UI. Or, they can still browse a version that contains 
the List-ID:
   
   ```
   https://lists.example.org/alt.small.archive/t/bnbqz6hb4gplpvtz7zmlhymj
   ```
   
   But, importantly, `alt.small.archive` is not part of the DKIM-ID here. This 
means that messages are still deduplicated even when they appear in multiple 
mailing lists.
   
   To argue that manual List-IDs should be part of DKIM-IDs would remove *all* 
of the above. In particular:
   
   * Metadata and sources would be duplicated across mailing lists
   * Showing what lists a DKIM-ID appears in would require a prefix search 
incompatible with elasticsearch keyword arguments
   * Ponymail administrators would be induced to just rely on their users to 
backup the permalinks in case of catastrophic data loss instead of performing 
the reliable backup method described above
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to