sbp commented on pull request #517: URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679952122
@sebbASF It is not yet documented why the command line list ID would need to be present in the permalink. Am I right in thinking that the following is the only use case? Consider an mbox archive whose messages contain no list IDs in common with the command line list ID imposed by the administrator. All of its messages in Ponymail are later lost, but the original mbox archive file is still available. Since the messages in Ponymail were lost, the command line list ID is also lost. But since the command line list ID was present in permalinks, if a user of the list has that any permalink available to them then the command line list ID can be recovered. I can think of no other use case. There are far better data recovery strategies available. One could, for example, maintain a mapping of command line list IDs to any individual DKIM IDs only contained within that list. This is suitable in the case where an entire archive is expected to be recovered. Such a mapping file would be extremely small, on the order of KiB, and would therefore be easily replicated across many systems. If only individual messages are expected to be recovered, then the mapping of command line list IDs to all DKIM IDs would be necessary. This would only require storing sixteen bytes for every email in the system, so even an archive with a million emails would only require a mapping file of about 15 MiB. Even in the original suboptimal strategy, it is not necessary to make the command line list ID a mandatory part of a permalink. It could instead be made optional, like labels used in Amazon URLs, some weblog software, and on some news sites, as the following examples demonstrate: ``` https://www.amazon.com/Apache-Definitive-Guide-Ben-Laurie/dp/0596002033 https://www.amazon.com/Anything-Can-Go-Here/dp/0596002033 https://www.amazon.com/dp/0596002033 https://lobste.rs/s/j7p2ow/what_are_you_doing_this_week https://lobste.rs/s/j7p2ow/anything_can_go_here https://lobste.rs/s/j7p2ow https://www.reuters.com/article/apache-moves-on-traffic-server-machine-learning-projects-idUS57202199920100504 https://www.reuters.com/article/anything-can-go-here-idUS57202199920100504 https://www.reuters.com/article/idUS57202199920100504 ``` Amazon and Reuters use an infix pattern, whereas Lobsters uses a suffix pattern. Users could strip the Ponymail list ID, whether command line or archive metadata derived, from the permalink: ``` https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng/dev.project.apache.org https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng/anything.can.go.here https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng ``` Or if the malleability of `anything.can.go.here` is undesirable, the UI software could ensure that the message actually appears in the list ID in the optional part of the URL. But I think that, as @rbowen noted, the first thing any user wants to do with a URL that's too long to easily share is to shorten it, either by taking out optional components or by submitting it to a link shortener. *Links are themselves UI, and they ought to be designed in a user friendly way.* Links which are too long are not user friendly, and this is why sites use IDs like `0596002033`, `j7p2ow`, and `idUS57202199920100504`, to recapitulate the actual examples mentioned above. They don't use mandatory IDs like `MTIzNDU2Nzg5MDEyMzQ1Ng_dev.project.apache.org`. Even `MTIzNDU2Nzg5MDEyMzQ1Ng` could be regarded as too long, but unlike Amazon, Lobsters, and Reuters we have the constraint that we would like to be able to generate the ID again from the content, which means using a hash, which means considering the hash security; and indeed I provided an informal analysis earlier in this thread. I would very much like wider review and more discussion of this pull request. I notice, however, that the 40 or so messages, from four contributors, currently in this thread compares rather favourably to the following number of messages in the threads of all previous PRs on Ponymail: **0, 0, 0, 5, 2, 3, 0, 2, 1, 0, 3, 3, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 2, 0, 2, 2, 1, 0, 3, 2, 10, 4** Combined, this is 54 messages across every single PR, merged or unmerged. I count that 17 out of 35 PRs were merged. I also counted the number of participants in the threads of *only the merged* PRs, giving the following figures: **1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 2, 3, 2, 2** I notice, therefore, that the current PR already far exceeds the amount of review of all existing PRs, almost surpassing their combined number of messages, and that the number of contributors to the thread already surpasses that of every existing merged PR. Despite this, I repeat the call for wider review. Clearly this is a substantial contribution, and many of the prior PRs were trivial. I would especially like, for example, somebody to audit the behaviour of my algorithm vs the reference algorithm in the `dkimpy` package, and to provide a more formal analysis of the security parameters of the hash. It is also clear that this PR needs to be modified before it can be accepted. As I understand it, the following modifications could aid consensus: * The hash encoding could be converted to base64 * The hash digest length could be 128 bits, encoded as 22 characters * The pepper mechanism should be removed * The command line List ID should not be added to the message before hashing * The algorithm could potentially also be renamed It would also be useful if objecting participants would *concisely* state all of their remaining objections. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
