Hey all,

I'm completely sucking at finding a bug. Allow me to talk at you in hopes that someone knows the answer offhand.

MogileFS::Worker::Delete has the following code:

        # hit up the server and delete it
# TODO: (optimization) use MogileFS->get_observed_state and don't try to
 delete things known to be down/etc
        my $sock = IO::Socket::INET->new(PeerAddr => $urlparts->[0],
                                         PeerPort => $urlparts->[1],
                                         Timeout => 2);
        unless ($sock) {
# timeout or something, mark this device as down for now and move on
            $self->broadcast_host_unreachable($dev->hostid);
            $reschedule_fid->(60 * 60 * 2, "no_sock_to_hostid");
            next;
        }

(which I've now terribly pasted).

If a host times out, the deleter broadcasts to all workers that the host is unreachable. I think this is a little excessive, but it should be okay because:

MogileFS::Worker::Monitor should re-broadcast a 'reachable' state within the next ten seconds, if the host is actually up and the timeout was a fluke.

Except the delete job is never getting that message, and the procmanager code prevents the job monitors subsequent broadcasts from being sent to the deleter, since the status hasn't changed.

The symptom of this is any deletes destined for those hosts get cycled through file_to_delete_later and back again every 600 seconds.

Not 100% sure I'm looking in the right place. Given the timeout (600 seconds) and that chunk of code, this is probably right? I should be able to verify this by sending !to commands to the delete job to say those hosts are back up, but I haven't gotten that to work yet.

Ideas?
-Dormando

Reply via email to