We are using multiple disks - 1 disk for the journal and 3 for the ledgers
on each machine.
I went through the logs again and here are my observations.
- The test fails with all bookies disconnecting from the hedwig hub and
hedwig not updating the ensemble.
- Some time (20 seconds) before all bookies get disconnected, there was an
instance where one of the bookies failed and hedwig updated the ensemble of
bookies successfully.
- I analyzed the hedwig-logs for this one bookie and found that around 170k
add entries were sent to the bookie over 7 minutes. i.e. there were around
170000 entries for which the hedwig-server was able to successfully write
an ADDENTRY request to this bookie. From these, all but the last 3000
entries got a response from the bookie. i.e. the log-file has a "Got
response for add request ..." message for that entry for the bookie in
question.
So, it seems that at some point in time, the bookie has accumulated around
3000 requests to add and has not acked any of them.
Entry 283736 is the last entry which the bookie acked. There is a
difference of 2 seconds between the time that the hedwig-server wrote an
ADDENTRY request successfully for this entry and for the entry after that
(which is not acked).
2012-04-05 23:31:13,884 - DEBUG [New I/O client worker
#1-3:PerChannelBookieClient$2@252] - Successfully wrote request for adding
entry: 283736 ledger-id: 25 bookie: /10.34.235.129:3181 entry length: 111
2012-04-05 23:31:15,953 - DEBUG [New I/O client worker
#1-3:PerChannelBookieClient$2@252] - Successfully wrote request for adding
entry: 283737 ledger-id: 25 bookie: /10.34.235.129:3181 entry length: 111
Also, the SyncThead stops printing debug messages on the bookie after the
connection to the hedwig-server is closed. Only the GarbageCollector thread
and a thread that periodically gets the ledgers from ZK are printing debug
messages. Don't know if this is the expected behavior.
There is provisioning for throttling in the NIOServerFactory class used by
bookkeeper, but it seems that it's not being implemented. The maximum
outstanding requests (outstandingLimit) is set at 2000. outstandingRequests
are being decremented on every call to NIOServerFactory.sendResponse but
not incremented anywhere. Consequently, OP_READ is not being disabled when
this threshold is reached. Which makes me think whether 3k outstanding
requests is high for the bookie. Is there a reason why throttling was
disabled? I could try implementing throttling and re-run the load test and
see how it goes.
This behavior of around 3k outstanding requests is seen in the other
bookies when they finally fail as well.
Another thing that bothers me is why hedwig doesn't try to update it's
ensemble when all the bookies disconnect. In the case of the previous
single failure, it printed a "Write did not succeed ..." warning in the log
file and then updated the ensemble, but this is not happening when all the
bookies die. It just stops everything and only prints messages regarding a
ping response from zookeeper. ("Got ping response for sessionid:
0x43630f5821e0053 after 0ms")
Regards,
Aniruddha.
On Wed, Apr 4, 2012 at 9:15 AM, Flavio Junqueira <[email protected]> wrote:
> The exceptions seem to indicate that they couldn't find files to compact.
> Without more information it is hard to tell, but it doesn't look like that
> is what's causing your problems. Also, the load you're imposing shouldn't
> be saturating, unless you're underprovisioning. Are you using two disks for
> each bookie?
>
> -Flavio
>
> On Apr 4, 2012, at 1:11 AM, Aniruddha Laud wrote:
>
> > There are no bookkeeper errors at around the time the hedwig hub
> disconnects from the bookies. However, a few minutes before the
> disconnection, I see some exceptions thrown in the bookkeeper log
> >
> > I've attached the exceptions in a file.
> >
> > We are trying to load test hedwig. Around 1000 QPS for one topic were
> sustained for about 30 minutes. We then cranked up the load to around 2000
> QPS for the same topic and we got this error. The setup is 15 hedwig hubs
> and 15 bookies with ensemble size of 5 and replication factor of 3.
> >
> > Regards,
> > Aniruddha.
> >
> > On Tue, Apr 3, 2012 at 2:09 AM, Ivan Kelly <[email protected]> wrote:
> > This type of disconnection occurs when there's a read timeout from one
> of the bookies. The cause could be something crashing on the bookie side,
> or simply a very slow network. What type of network are you running this in?
> > Do you have any logs on the bookie side?
> >
> > -Ivan
> >
> > On 3 Apr 2012, at 03:22, Aniruddha Laud wrote:
> >
> > > While sending requests to a hedwig hub, the hub seems to disconnect
> from
> > > the bookies and never connects back. The logfile contains
> > >
> > > 2012-04-02 22:33:09,207 - INFO [Hashed wheel timer
> > > #3:PerChannelBookieClient@409] - Disconnected from bookie: /
> > > 10.35.84.103:3181
> > > 2012-04-02 22:33:09,211 - INFO [Hashed wheel timer
> > > #4:PerChannelBookieClient@409] - Disconnected from bookie: /
> > > 10.34.133.114:3181
> > > 2012-04-02 22:33:09,214 - INFO [Hashed wheel timer
> > > #5:PerChannelBookieClient@409] - Disconnected from bookie: /
> > > 10.35.89.103:3181
> > > 2012-04-02 22:33:09,217 - INFO [Hashed wheel timer
> > > #8:PerChannelBookieClient@409] - Disconnected from bookie: /
> > > 10.35.91.102:3181
> > > 2012-04-02 22:33:09,247 - INFO [Hashed wheel timer
> > > #10:PerChannelBookieClient@409] - Disconnected from bookie: /
> > > 10.34.234.125:3181
> > > 2012-04-02 22:33:09,256 - INFO [Hashed wheel timer
> > > #7:PerChannelBookieClient@409] - Disconnected from bookie: /
> > > 10.34.235.129:3181
> > >
> > > Some time before getting this message, the "Got response for ..."
> messages
> > > stop and there are only "Successfully wrote request ..." messages in
> the
> > > hedwig log file. The bookkeeper log-file shows no indication of the
> > > connection being lost. All the bookies and hedwig hubs are up and
> running
> > > and I am able to connect to them with the hedwig console and able to
> create
> > > new topics and publish/subscribe to them. But I'm not able to publish
> or
> > > subscribe to the topic that caused the errors. About 200,000 entries
> were
> > > created in the topic that caused this error.
> > >
> > > I'm unable to attach the log files or even portions of it, because the
> > > relevant portions are around 3MB.
> > >
> > > Regards,
> > > Aniruddha.
> >
> >
> > <bookieexception.txt>
>
> flavio
> junqueira
> senior research scientist
>
> [email protected]
> direct +34 93-183-8828
>
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300 fax (408) 349 3301
>
>