Hi Ivan, Kindly take a look at https://issues.apache.org/jira/browse/BOOKKEEPER-215. It seems like a deadlock is occurring under high load.
Regards, Aniruddha. On Tue, Apr 10, 2012 at 7:42 AM, Ivan Kelly <[email protected]> wrote: > > 2012-04-05 23:31:13,884 - DEBUG [New I/O client worker > > #1-3:PerChannelBookieClient$2@252] - Successfully wrote request for > adding > > entry: 283736 ledger-id: 25 bookie: /10.34.235.129:3181 entry length: > 111 > > 2012-04-05 23:31:15,953 - DEBUG [New I/O client worker > > #1-3:PerChannelBookieClient$2@252] - Successfully wrote request for > adding > > entry: 283737 ledger-id: 25 bookie: /10.34.235.129:3181 entry length: > 111 > > > > Also, the SyncThead stops printing debug messages on the bookie after the > > connection to the hedwig-server is closed. Only the GarbageCollector > thread > > and a thread that periodically gets the ledgers from ZK are printing > debug > > messages. Don't know if this is the expected behavior. > > Could you dump the stacktraces for all threads on the bookie > (kill -QUIT <pid>) so we can see where the threads are. The SyncThread > shouldn't be printing anything if the client has disconnected, if > nothing has been written, nothing with sync. > > > There is provisioning for throttling in the NIOServerFactory class used > by > > bookkeeper, but it seems that it's not being implemented. The maximum > > outstanding requests (outstandingLimit) is set at 2000. > outstandingRequests > > are being decremented on every call to NIOServerFactory.sendResponse but > > not incremented anywhere. Consequently, OP_READ is not being disabled > when > > this threshold is reached. Which makes me think whether 3k outstanding > > requests is high for the bookie. Is there a reason why throttling was > > disabled? I could try implementing throttling and re-run the load test > and > > see how it goes. > Yikes, this thottling seems to be a vestigial organ from long ago. It > doesn't do anything. This hasn't manifested as a problem, but it could > quite easily. Once we serve ~2 billion requests on a bookie, the bookie > will block forever as the int will have looped around. > > > > > Another thing that bothers me is why hedwig doesn't try to update it's > > ensemble when all the bookies disconnect. In the case of the previous > > single failure, it printed a "Write did not succeed ..." warning in the > log > > file and then updated the ensemble, but this is not happening when all > the > > bookies die. It just stops everything and only prints messages regarding > a > > ping response from zookeeper. ("Got ping response for sessionid: > > 0x43630f5821e0053 after 0ms") > What happens with the client threads? Have they completely hung? Is > zookeeper still accessible? > > -Ivan >
