I finally had some time to look into this. It appears the issue is that when the server says it has X messages, but some of them are errors (couldn't be read, etc,) the client doesn't realize that it actually has less than X messages to process. This causes the parent/child mass-check client to deadlock when the parent reads off the end of the list, doesn't therefore send a message to the child, and then waits for the child to return a result while the child continues to wait for a message.
Looking through the code, there are several other potential ways that this
could also happen (lots of "next" and "last" entries while processing the
server's response) beyond just the msg-error entries from the server. I
submitted r721907 to (hopefully) deal with the issue generically.
Now I need to go through and find out why the server has so many errors
accessing a non-changing corpus. :(
On Fri, Oct 31, 2008 at 12:06:35PM -0400, Theo Van Dinter wrote:
> This week I noticed that my usual run took around 23h to complete, which
> is much (2x?) longer than usual. Poking around, it seems that my second
> machine starts running through it's message queue and then stops at some
> point, leaving only the first machine to do the processing.
>
> I'm not going to be able to deal with debugging it for a while, so I
> decided to just turn off the cronjobs for now and take a look in a few
> weeks when I get some time.
--
Randomly Selected Tagline:
"Dad, are you okay? I see food on your plate instead of blurry motions."
- Lisa on the Simpsons, "Husbands and Knives"
pgpk5Qy8ZcLu3.pgp
Description: PGP signature
