Re: [Nmh-workers] Large MH directories

Ken Hornstein Sat, 08 Apr 2017 07:58:57 -0700

>Is this a bad idea? I'm mainly concerned about speed: Other than like
>pick and scan on the whole folder, will anything else get very slow? I'm
>using Fast File System on OpenBSD.


We've been asked about this before; unfortunately, with the current (n)mh
design, it's just going to suck.

A bit of an explanation: right now, pretty much every nmh program that
wants to interpret a folder calls folder_read(). folder_read() does
a readdir() on the ENTIRE folder.  It does NOT do a stat() on every
file, but if you have 600k files in a single directory, that's just
going to take some time.  I think that's going to be true in most Unix
operating systems today; that's just a crazy number of files for a
single directory, and pretty much all advice I've ever seen about
this problem is, "Don't do that".

The bottleneck there for us seems to be readdir(); the buffer there is
relatively small when it calls getdirentries() (or the equivalent), and
you end up calling it a huge number of times for large folders.  You CAN
call getdirentries() directly with a larger buffer, but that is really
non-portable and hard to do (it is not part of POSIX, and everybody does
their own thing there).

So, it is possible, like Ralph suggested, to lazily do the readdir() calls?
Short answer is: "maybe, but it would be hard".

The current API exposes all of the folder structure information which
contains things like the lowest message number, the highest message
number, and the message count; you can't really determine that without
reading the whole folder.  Okay, cleaning up the API is on the list,
so that's not insurmountable.  But thinking about how to do that to
have some gains involves some design changes.  Right now whenever the
sequence file is touched (and pretty much every nmh program does that
by default), it compares that to the message list (something ELSE you
can only get by reading the entire directory) and silently cleans up
sequence entries that no longer exist.  That is something we could relax
a bit, but that would mean anything that wants to manipulate sequences
would either require "fresh" information, or we'd have to tell people
that information could be stale.

Also, the message list is used to tell if a message number is valid
or not.  So we'd have to also change things in terms of checking for
message number validity.  Again, this is not insurmountable, but it is
a fair amount of work.

In terms of programs which would see a gain ... hmm.  I think it would
only be limited to programs which operate on a single message and did
not modify the folder.  So, show(1), comp(1) (are you using Fcc to a
large folder?  Then no), repl(1) (let's assume your draft folder isn't
huge), anno(1) ... I am trying to think of what other programs would see
a gain. next(1)/prev(1) would NOT, because it would need to know the
next message number.  Maybe you could get away with rmm(1) not reading
the whole folder.  If you're using mark(1) to list a sequence, can you
live with it returning stale information?  If you ADD messages to a
sequence do you want to make sure they are valid message numbers?

So, it gets complicated, and the gains would not be shared across all
programs (or maybe, between different options of the same program).  I
think it is POSSIBLE to get some speedups here.  But it's going to take
some careful thought.  Interestingly enough, direct IMAP support in nmh
would be a huge win here, because the server would take care of a lot of
this for us.

I welcome any thoughts here, and any code contributions would be even
MORE welcome!

--Ken

_______________________________________________
Nmh-workers mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/nmh-workers

Re: [Nmh-workers] Large MH directories

Reply via email to