Hi Aniruddha, On Sep 8, 2012, at 4:55 AM, Aniruddha Laud wrote:
> One of our hedwig hubs was stuck on reading a particular entry from > bookkeeper. The entry it was trying to read did not exist in any of the > bookies from the ensemble responsible for that entry (ensemble information > obtained from the zookeeper entry). However, some other bookies that were > not in that ensemble did have that entry. I suppose that you ran into crash scenarios. Otherwise I'm not sure how bookies outside the ensemble could have the entry. Is this correct? > > We have a quorum size of 3 and ensemble size of 4, so the expected behavior > would be for every 4th entry to be absent for a ledger on any bookie from > that ensemble. Say you have b1, b2, b3, b4 as bookies of your ledger ensemble. Every 4th entry should be stored on b1, b2, b4 if your write quorum has size 3. Do you agree? > However, this was not the case. Some bookies had gaps > greater than 1 for that ledger, while in some places, the gap was 0. The > ensemble was changed for the same ledger-id, start-entry-id pair many times > (around 25) over a period of 14 minutes. > It is not clear if the changes to the ensemble were induced or if the ensemble was changing without any apparent problem. The behavior seems awkward, though, and it would be great to see some logs. > After the last "Unsetting success for ledger ... " message from > PendingAddOp for that particular (ledger, entry) pair, the ensemble changes > at least 2 times with the same startEntryId, but we don't see any > "Unsetting success messages". Is this the same entry you couldn't read? > > None of the fields from LedgerHandle or PendingAddOp are thread safe, yet, > it seems that they could be accessed from different threads. For example, > it seems like PendingAddOp#writeComplete is called from a different thread > than PendingAddOp#unsetSuccessAdnSendWriteRequest. > You might be right here. These two methods seem to be called from different threads and they both update numResponsesPending, which is just an int. We need to look into this further. > I took a look at BOOKKEEPER-337 but I'm not sure if that fixes this. Does > it? It is still a bit unclear what the problem is, so I can't really tell. > > Any insight would be helpful. Also, is there any way to recover from this > :) ? If the entry still exists in a bookie, it is possible to read directly from a bookie and recover data. I believe Sijie developed some tools for this, but I can't find the jira number now. -Flavio
