> 2012-04-05 23:31:13,884 - DEBUG [New I/O client worker
> #1-3:PerChannelBookieClient$2@252] - Successfully wrote request for adding
> entry: 283736 ledger-id: 25 bookie: /10.34.235.129:3181 entry length: 111
> 2012-04-05 23:31:15,953 - DEBUG [New I/O client worker
> #1-3:PerChannelBookieClient$2@252] - Successfully wrote request for adding
> entry: 283737 ledger-id: 25 bookie: /10.34.235.129:3181 entry length: 111
>
> Also, the SyncThead stops printing debug messages on the bookie after the
> connection to the hedwig-server is closed. Only the GarbageCollector thread
> and a thread that periodically gets the ledgers from ZK are printing debug
> messages. Don't know if this is the expected behavior.
Could you dump the stacktraces for all threads on the bookie
(kill -QUIT <pid>) so we can see where the threads are. The SyncThread
shouldn't be printing anything if the client has disconnected, if
nothing has been written, nothing with sync.
> There is provisioning for throttling in the NIOServerFactory class used by
> bookkeeper, but it seems that it's not being implemented. The maximum
> outstanding requests (outstandingLimit) is set at 2000. outstandingRequests
> are being decremented on every call to NIOServerFactory.sendResponse but
> not incremented anywhere. Consequently, OP_READ is not being disabled when
> this threshold is reached. Which makes me think whether 3k outstanding
> requests is high for the bookie. Is there a reason why throttling was
> disabled? I could try implementing throttling and re-run the load test and
> see how it goes.
Yikes, this thottling seems to be a vestigial organ from long ago. It
doesn't do anything. This hasn't manifested as a problem, but it could
quite easily. Once we serve ~2 billion requests on a bookie, the bookie
will block forever as the int will have looped around.
>
> Another thing that bothers me is why hedwig doesn't try to update it's
> ensemble when all the bookies disconnect. In the case of the previous
> single failure, it printed a "Write did not succeed ..." warning in the log
> file and then updated the ensemble, but this is not happening when all the
> bookies die. It just stops everything and only prints messages regarding a
> ping response from zookeeper. ("Got ping response for sessionid:
> 0x43630f5821e0053 after 0ms")
What happens with the client threads? Have they completely hung? Is
zookeeper still accessible?
-Ivan