Troubleshooting multiprocess communication

dormando Fri, 21 Mar 2008 10:43:51 -0700

Hey all,

I'm completely sucking at finding a bug. Allow me to talk at you inhopes that someone knows the answer offhand.


MogileFS::Worker::Delete has the following code:

        # hit up the server and delete it

# TODO: (optimization) use MogileFS->get_observed_state anddon't try to

 delete things known to be down/etc
        my $sock = IO::Socket::INET->new(PeerAddr => $urlparts->[0],
                                         PeerPort => $urlparts->[1],
                                         Timeout => 2);
        unless ($sock) {

# timeout or something, mark this device as down for nowand move on

            $self->broadcast_host_unreachable($dev->hostid);
            $reschedule_fid->(60 * 60 * 2, "no_sock_to_hostid");
            next;
        }

(which I've now terribly pasted).

If a host times out, the deleter broadcasts to all workers that the hostis unreachable. I think this is a little excessive, but it should beokay because:

MogileFS::Worker::Monitor should re-broadcast a 'reachable' state withinthe next ten seconds, if the host is actually up and the timeout was afluke.

Except the delete job is never getting that message, and the procmanagercode prevents the job monitors subsequent broadcasts from being sent tothe deleter, since the status hasn't changed.

The symptom of this is any deletes destined for those hosts get cycledthrough file_to_delete_later and back again every 600 seconds.

Not 100% sure I'm looking in the right place. Given the timeout (600seconds) and that chunk of code, this is probably right? I should beable to verify this by sending !to commands to the delete job to saythose hosts are back up, but I haven't gotten that to work yet.


Ideas?
-Dormando

Troubleshooting multiprocess communication

Reply via email to