>Is this a bad idea? I'm mainly concerned about speed: Other than like >pick and scan on the whole folder, will anything else get very slow? I'm >using Fast File System on OpenBSD.
We've been asked about this before; unfortunately, with the current (n)mh design, it's just going to suck. A bit of an explanation: right now, pretty much every nmh program that wants to interpret a folder calls folder_read(). folder_read() does a readdir() on the ENTIRE folder. It does NOT do a stat() on every file, but if you have 600k files in a single directory, that's just going to take some time. I think that's going to be true in most Unix operating systems today; that's just a crazy number of files for a single directory, and pretty much all advice I've ever seen about this problem is, "Don't do that". The bottleneck there for us seems to be readdir(); the buffer there is relatively small when it calls getdirentries() (or the equivalent), and you end up calling it a huge number of times for large folders. You CAN call getdirentries() directly with a larger buffer, but that is really non-portable and hard to do (it is not part of POSIX, and everybody does their own thing there). So, it is possible, like Ralph suggested, to lazily do the readdir() calls? Short answer is: "maybe, but it would be hard". The current API exposes all of the folder structure information which contains things like the lowest message number, the highest message number, and the message count; you can't really determine that without reading the whole folder. Okay, cleaning up the API is on the list, so that's not insurmountable. But thinking about how to do that to have some gains involves some design changes. Right now whenever the sequence file is touched (and pretty much every nmh program does that by default), it compares that to the message list (something ELSE you can only get by reading the entire directory) and silently cleans up sequence entries that no longer exist. That is something we could relax a bit, but that would mean anything that wants to manipulate sequences would either require "fresh" information, or we'd have to tell people that information could be stale. Also, the message list is used to tell if a message number is valid or not. So we'd have to also change things in terms of checking for message number validity. Again, this is not insurmountable, but it is a fair amount of work. In terms of programs which would see a gain ... hmm. I think it would only be limited to programs which operate on a single message and did not modify the folder. So, show(1), comp(1) (are you using Fcc to a large folder? Then no), repl(1) (let's assume your draft folder isn't huge), anno(1) ... I am trying to think of what other programs would see a gain. next(1)/prev(1) would NOT, because it would need to know the next message number. Maybe you could get away with rmm(1) not reading the whole folder. If you're using mark(1) to list a sequence, can you live with it returning stale information? If you ADD messages to a sequence do you want to make sure they are valid message numbers? So, it gets complicated, and the gains would not be shared across all programs (or maybe, between different options of the same program). I think it is POSSIBLE to get some speedups here. But it's going to take some careful thought. Interestingly enough, direct IMAP support in nmh would be a huge win here, because the server would take care of a lot of this for us. I welcome any thoughts here, and any code contributions would be even MORE welcome! --Ken _______________________________________________ Nmh-workers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/nmh-workers
