Hi Ivan,

Kindly take a look at https://issues.apache.org/jira/browse/BOOKKEEPER-215.
It seems like a deadlock is occurring under high load.

Regards,
Aniruddha.

On Tue, Apr 10, 2012 at 7:42 AM, Ivan Kelly <[email protected]> wrote:

> > 2012-04-05 23:31:13,884 - DEBUG [New I/O client worker
> > #1-3:PerChannelBookieClient$2@252] - Successfully wrote request for
> adding
> > entry: 283736 ledger-id: 25 bookie: /10.34.235.129:3181 entry length:
> 111
> > 2012-04-05 23:31:15,953 - DEBUG [New I/O client worker
> > #1-3:PerChannelBookieClient$2@252] - Successfully wrote request for
> adding
> > entry: 283737 ledger-id: 25 bookie: /10.34.235.129:3181 entry length:
> 111
> >
> > Also, the SyncThead stops printing debug messages on the bookie after the
> > connection to the hedwig-server is closed. Only the GarbageCollector
> thread
> > and a thread that periodically gets the ledgers from ZK are printing
> debug
> > messages. Don't know if this is the expected behavior.
>
> Could you dump the stacktraces for all threads on the bookie
> (kill -QUIT <pid>) so we can see where the threads are. The SyncThread
> shouldn't be printing anything if the client has disconnected, if
> nothing has been written, nothing with sync.
>
> > There is provisioning for throttling in the NIOServerFactory class used
> by
> > bookkeeper, but it seems that it's not being implemented. The maximum
> > outstanding requests (outstandingLimit) is set at 2000.
> outstandingRequests
> > are being decremented on every call to NIOServerFactory.sendResponse but
> > not incremented anywhere. Consequently, OP_READ is not being disabled
> when
> > this threshold is reached. Which makes me think whether 3k outstanding
> > requests is high for the bookie. Is there a reason why throttling was
> > disabled? I could try implementing throttling and re-run the load test
> and
> > see how it goes.
> Yikes, this thottling seems to be a vestigial organ from long ago. It
> doesn't do anything. This hasn't manifested as a problem, but it could
> quite easily. Once we serve ~2 billion requests on a bookie, the bookie
> will block forever as the int will have looped around.
>
> >
> > Another thing that bothers me is why hedwig doesn't try to update it's
> > ensemble when all the bookies disconnect. In the case of the previous
> > single failure, it printed a "Write did not succeed ..." warning in the
> log
> > file and then updated the ensemble, but this is not happening when all
> the
> > bookies die. It just stops everything and only prints messages regarding
> a
> > ping response from zookeeper. ("Got ping response for sessionid:
> > 0x43630f5821e0053 after 0ms")
> What happens with the client threads? Have they completely hung? Is
> zookeeper still accessible?
>
> -Ivan
>

Reply via email to