Apologies for the late reply, took me a while to conduct a few more tests, and been busy with other work as well, but here we go.
> On Wed, Jun 18, 2008 at 04:12:11PM +0200, Gergely Nagy wrote: >> I put up two reports (load, iostat, queue, ps ax) at >> http://195.70.33.28/~algernon/exim-stuff/ > > Quite interesting, because I fail to see what should drive the load up > this far, too. > > But: The disk load is not spread well. sda/sdb get quite some write > operations and might be a limit soon. The rest has a good distribution > of ops. It might be interesting why the traffic is so different, but > it is not a problem at this time and likely won't be for quite a while. Most probably we'll fill the disks sooner. >> There are quite a few exims stuck in D, indeed, as expected. The >> question remains, though - why? Why is it taking that long to process a >> maildir, when my quickly hacked up perl script finished even the largest >> directory within 10 seconds (and most others in one). > > Are they stuck in processing maildirs? Try stracing them to see what > exactly takes so long. That's most important to me. As far as I see, yes, they're stuck in processing & updating the maildirs. From what I've seen, we have a few users with stupidly large directories, and it takes long seconds to even list them (long, as in >30 seconds), even when the load is low (all services disabled, nothing running at all, around 0.01 load). Now, these particular users happen to receive a lot of mail during daytime, and as far as I understand, the maildirsize files are opened O_EXCL. Since the mails arrive faster than exim is able to process the directories, they get stuck waiting for the lock, bumping the load up to the sky. >> However, there's one idea I was thinking of: whether it is possible to >> give a transport a timeout, so if it does not finish within N seconds, >> it aborts, logs an error, and will get retried later on? > > Bad idea. The started transport will cause page faults and I/O ops for > nothing if you abort it, thus increasing overall load. > > One thing that comes to mind: Did you separate the spool from the > maildirs? That helps a lot. Yes, I did. The spool is on a separate disk. > Try putting a number of spool directories on > single filesystems, no need to use RAID0/dm there, as Exim distributes the > load evenly. If you experience a bad ops distribution on the maildirs, > try creating the filesystem with a different group size, best one that > is prime to the chunk size of the underlying RAID groups, to distribute > group beginnings among all devices. That helps particularly with rather > empty filesystems, as they fill, the problem disappears naturally. At the moment, the pool is on raid aswell, but I'll give it a try without. > And finally, not related with your system: Unless you have a good reason > not to, use as large RAID stripes as you can, if you make use of the > whole disk anyway. Split operations have a higher cost, and small stripes > split them more often. Thanks for the suggestion. -- Gergely Nagy <[EMAIL PROTECTED]> -- ## List details at http://lists.exim.org/mailman/listinfo/exim-users ## Exim details at http://www.exim.org/ ## Please use the Wiki with this list - http://wiki.exim.org/
