So, I made some improvements to imaptest and I decided to really stress this a bit. I created a Gmail account and started uploading the Enron corpus to it. I didn't quite get it all (it looks like Gmail closed the connection after 4 days or so), but it was almost all. BTW, it turns out it takes on average 0.66 seconds to append a message to a Gmail mailbox, so you can imagine how long it took to get most of the corpus uploaded. To make the test more extreme, I put everything in one mailbox.
Some things that popped out: - Connection & authentication is very quick (I am using XOAUTH2): Connect time: 0.014068 sec TLS negotation time: 0.034776 sec Authentication time: 0.111667 sec - However, accessing the mailbox, not so much: (tls-encrypted) => A2 SELECT "Enron" (tls-decrypted) <= * FLAGS (\Answered \Flagged \Draft \Deleted \Seen $NotPhishing $Phishing fart) (tls-decrypted) <= * OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen $NotPhishing $Phishing fart \*)] Flags permitted. (tls-decrypted) <= * OK [UIDVALIDITY 3517] UIDs valid. (tls-decrypted) <= * 480832 EXISTS (tls-decrypted) <= * 0 RECENT (tls-decrypted) <= * OK [UIDNEXT 480833] Predicted next UID. (tls-decrypted) <= * OK [HIGHESTMODSEQ 11396890] (tls-decrypted) <= A2 OK [READ-WRITE] Enron selected. (Success) Command (SELECT) execution time: 4.457428 sec I don't even have an idea how long it would have taken for nmh to do a readdir() on a directory with that many files. Strangely, the speed of adding messages to that mailbox seemed to not depend on the number of messages in the mailbox, but it varied depending on the time of day. Performing a scan equvalent on that many messages kind of bogs down also: % imaptest +Enron 'FETCH 1:5000 (FLAGS RFC822.SIZE BODY.PEEK[HEADER.FIELDS (FROM TO SUBJECT DATE)] BODY.PEEK[TEXT]<0.80>)' -timestamp Connect time: 0.013550 sec TLS negotation time: 0.025930 sec Command (CAPABILITY) execution time: 0.012074 sec Command (AUTHENTICATE) execution time: 0.034879 sec Authentication time: 0.104827 sec Command (SELECT) execution time: 4.475118 sec Command (FETCH) execution time: 44.801250 sec Total command execution time: 49.276434 sec Command (LOGOUT) execution time: 0.015691 sec Total elapsed time: 49.410705 sec Compared to the performance of the Cyrus-SASL archives, that's kind of disappointing. But the mailbox is 40x bigger, so maybe that's the issue. Okay, so this is a lot better: CREATE Enron2 COPY 1:10426 Enron2 (approximately 43 seconds) (tls-encrypted) => A2 SELECT "Enron2" (tls-decrypted) <= * FLAGS (\Answered \Flagged \Draft \Deleted \Seen $NotPhishing $Phishing fart) (tls-decrypted) <= * OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted \Seen $NotPhishing $Phishing fart \*)] Flags permitted. (tls-decrypted) <= * OK [UIDVALIDITY 3518] UIDs valid. (tls-decrypted) <= * 10426 EXISTS (tls-decrypted) <= * 0 RECENT (tls-decrypted) <= * OK [UIDNEXT 10427] Predicted next UID. (tls-decrypted) <= * OK [HIGHESTMODSEQ 11407756] (tls-decrypted) <= A2 OK [READ-WRITE] Enron2 selected. (Success) Command (SELECT) execution time: 0.121310 sec But, still not great: % imaptest +Enron2 'FETCH 1:* (FLAGS RFC822.SIZE BODY.PEEK[HEADER.FIELDS (FROM TO SUBJECT DATE)] BODY.PEEK[TEXT]<0.80>)' -timestamp Connect time: 0.012951 sec TLS negotation time: 0.023604 sec Command (CAPABILITY) execution time: 0.010854 sec Command (AUTHENTICATE) execution time: 0.028128 sec Authentication time: 0.091857 sec Command (SELECT) execution time: 0.108625 sec Command (FETCH) execution time: 88.653104 sec Total command execution time: 88.761790 sec Command (LOGOUT) execution time: 0.014169 sec Total elapsed time: 88.880982 sec Gimap is really in the crapper here. Creating a new folder took me 3 seconds, so I wonder if there is some global index that needs traversing. Some operations don't scale linearly. If we use user flags as sequences (Gimap supports arbitrary flags), we get: +Enron 'STORE 1:10000 +FLAGS.SILENT (fart)' -snoop -timestamp (tls-encrypted) => A3 STORE 1:10000 +FLAGS.SILENT (fart) (tls-decrypted) <= A3 OK Success Command (STORE) execution time: 2.818275 sec You would think that 1:100000 would take 28-30 seconds, right? But no. It exceeds the timeout limit (60 seconds by default). If I don't use .SILENT, 1:100000 takes 574 seconds. But, interstingly enough if I run it again on the first 100000 messages, it takes 29 seconds; maybe that's due to not having to change the flags? More research needed. Hm, I get this on the whole folder: (tls-encrypted) => A3 STORE 1:* -FLAGS.SILENT (fart) (tls-decrypted) <= A3 OK Success Command (STORE) execution time: 874.630423 sec Linear scaling would sugest that it really should be closer to 130 seconds. Ahhh ... I think the key there is the database update. A second run is closer to where it should be: (tls-encrypted) => A3 STORE 1:* -FLAGS.SILENT (fart) (tls-decrypted) <= A3 OK Success Command (STORE) execution time: 147.341493 sec So if we do the first command again, making sure that flag is cleared, we get: (tls-encrypted) => A3 STORE 1:10000 +FLAGS.SILENT (fart) (tls-decrypted) <= A3 OK Success Command (STORE) execution time: 77.693338 sec - But ... where things are a win is here (on the original "Enron" folder) (tls-encrypted) => A3 SEARCH TEXT "corruption" (tls-decrypted) <= * SEARCH [... whole lot of entries ...] (tls-decrypted) <= A3 OK SEARCH completed (Success) Command (SEARCH) execution time: 0.136124 sec I doubt we could ever achieve that kind of performance on that many messages, and I guess this makes it clear where Google is putting their energy. I'm not sure where mailbox size becomes a problem. I was planning on uploading the corpus using the original folder structure and checking out how easy that is to manage. One thing I did notice is that at least on Gmail, folder creation time slows down roughly proportional to how many folders you have; creating 3500 folders takes a bit of time. So, this particular torture test didn't have as amazing results as I had hoped. But how would it compare to the same corpus on a disk? And what are "typical" operations? Do people really want to scan(1) a folder with a half-million messages in it? Or do they really want to run "pick" on it and only look at a few? --Ken -- Nmh-workers https://lists.nongnu.org/mailman/listinfo/nmh-workers
