sbp commented on pull request #517: URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691953754
Imagine we have a small mailing list archive in an mbox file called `ancient.mbox`. This mbox archive contains three emails, none of which contain a List-ID header. We import it using the manual command line List-ID `alt.small.archive`, and use the DKIM-ID generator that appends the manual List-ID. The permalinks are like this: ``` https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj_alt.small.archive https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct_alt.small.archive https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v_alt.small.archive ``` Now the catastrophic scenario happens, and the Ponymail database is lost! But we still have `ancient.mbox`, and we want to restore its three emails back into the `lists.example.org` Ponymail instance. How do we know what manual List-ID to use? The three emails do not include a List-ID. We made up `alt.small.archive`, but we did not record this fact and we no longer remember it. How do we get the List-ID? The idea behind putting `_alt.small.archive` in the permalinks is that now we can send a plea to our users asking: "does anybody have any links to an email that was in the database I just lost?", or use a search engine to try to find such permalinks. There are a couple of problems with this: * If the archive is small, what if *nobody, including search engines*, ever recorded those links? Our mailing list archives may be unpopular, private, or hidden from search engines using `robots.txt`. * Whom do we consult to find those permalinks? In other words, how do we even know who our users are? For sites that have a community around them, there may be a straightforward answer to this. But there is not a *general answer* to this. Therefore, the only reason to put a manual List-ID in permalinks is to support an **unreliable backup strategy**. The strategy is unreliable because it depends on arbitrary users to retain copies of the permalinks that can then be consulted to restore the data in the case of catastrophic database loss. And an unreliable backup strategy is made **unacceptable** when there is a reliable alternative. One alternative is that we can just rename `archive.mbox` to `alt.small.archive.mbox` when we import it. Or we can record the hash of `archive.mbox` into a file called `alt.small.archive.mbox-sha3` and keep it alongside `archive.mbox`. But those approaches have drawbacks too, e.g. if we obtain an mbox file which is differently ordered. Instead, here is a reliable alternative: Imagine we import our three emails from `ancient.mbox`, but this time *without* a manual List-ID in the permalinks. The permalinks are like this: ``` https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v ``` When we performed this import, we generated the three DKIM-IDs `bnbqz6hb4gplpvtz7zmlhymj`, `jnb2msdg3j6of3dco2bw5fct`, and `oxbrv2q2wbd23vfz5ttfcs7v`. These are each encodings of 16 bytes, for a total of 48 bytes. In general this is `16 * n` bytes, where `n` is the number of emails imported. We store these bytes in a file called `alt.small.archive.dkim-ids`. We now perform *standard backup procedures* for `alt.small.archive.dkim-ids`. We replicate it across environments, storing as many copies as possible in different geographic locations using different setups. This is easy to do because the file is only *48 bytes* long. We only need to store *48 bytes*, several times, to have a reliable backup of our manual List-ID. We can even include the manual List-ID plus line feed at the start of the file, so that we're not relying on the filename itself. Does this strategy scale? Consider a very large mailing list that has a million emails in it. The manual List-ID `.dkim-ids` backup file for such a list would be `16 * n` or `16 * 1,000,000` or `16,000,000` bytes long. This is only `15.2 MiB`. As of 2020 it is trivial to widely and reliably replicate fifteen mebibytes for backup purposes. What are the problems with this strategy? Unlike the unreliable and unacceptable backup strategy described above, it does *not* rely on arbitrary users or search engines to backup our data for us. It does *not* lead to the problem of wondering who to consult to restore that data. It follows established, standard industry practices for backing up our manual List-IDs, instead of the existing *ad hoc* and *idiosyncratic* method. For that reason, I could never recommend the strategy where manual List-IDs are part of the permalinks. I could never recommend that people use it as their backup strategy, because this superior strategy is available instead and it ticks all the boxes. It is, however, sometimes necessary to include the manual List-ID in the URL somewhere for UI purposes. Consider the email `bnbqz6hb4gplpvtz7zmlhymj` above. Let's say that as well as appearing in our `ancient.mbox` archive it was also sent to five other mailing lists, for a total of six, all of which are in the `lists.example.org` Ponymail instance. What should the UI say if a user visits the following address? ``` https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj ``` It doesn't know which List-ID to present to the user. In fact, in Ponymail and in Foal it doesn't even *retain the information* that this message was sent to six mailing lists if the DKIM-ID is the primary permalink, which is necessary in Ponymail if DKIM-ID is used at all, and is not necessary in Foal but it still possible. This would not be a problem if the manual List-ID were part of the permalink, but it solves one problem and causes another. DKIM-IDs were designed to deduplicate emails. If List-IDs are part of the DKIM-ID permalink, this means we would have to store *six* copies of the `bnbqz6hb4gplpvtz7zmlhymj` metadata, and *six* copies of its source too. But DKIM-IDs were explicitly designed to prevent this. Therefore we should solve the problem of retaining manual List-IDs another way. If we added them to the hash input of DKIM-IDs then we would lose our reliable backup strategy presented above. Thankfully there is a simple solution. In Foal commit <code>[178b729](https://github.com/apache/incubator-ponymail-foal/commit/178b729b9084a83034c0a87f150f23fd2ca48291)</code>, the multiple ID generators feature was added with the field `permalinks`. To support multiple manual List-IDs for a DKIM-ID identified message, all that would be required is to have an analogous field called `lids` for an array of List-IDs, just like `permalinks` is an array of generated permalink IDs. Then, if the user browses the URL above: ``` https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj ``` They can be presented with a list showing all List-IDs that this message belongs to, and the option to display the message *in its context in those lists*, specialising its UI. Or, they can still browse a version that contains the List-ID: ``` https://lists.example.org/alt.small.archive/t/bnbqz6hb4gplpvtz7zmlhymj ``` But, importantly, `alt.small.archive` is not part of the DKIM-ID here. This means that messages are still deduplicated even when they appear in multiple mailing lists. To argue that manual List-IDs should be part of DKIM-IDs would remove *all* of the above. In particular: * Metadata and sources would be duplicated across mailing lists * Showing what lists a DKIM-ID appears in would require a prefix search incompatible with elasticsearch keyword arguments * Ponymail administrators would be induced to just rely on their users to backup the permalinks in case of catastrophic data loss instead of performing the reliable backup method described above ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
