RaulGracia commented on issue #3466:
URL: https://github.com/apache/bookkeeper/issues/3466#issuecomment-1341079895

   Hi all, I think we could have another reproduction of this problem, this 
time using Bookkeeper 4.14.2 client/server (therefore, the issue reported here 
may be present in other versions of Bookkeeper, not only in Bookkeeper 4.15).
   
   This is the information I have been able to recollect so far (the missing 
ledger this time is `3538`):
   
   1. Ledger is created correctly: 
   ```
   2022-12-05 15:21:11,541 8491908 [core-23] INFO  
i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Created Ledger 3538.
   ``` 
   
   2. Pravega can write and process properly up to entry `52`:
   ```
   2022-12-05 15:26:53,197 8833564 [BookKeeperClientWorker-OrderedExecutor-2-0] 
DEBUG o.a.bookkeeper.client.PendingAddOp - Submit callback (lid:3538, eid: 52). 
rc:0
   2022-12-05 15:26:53,197 8833564 [core-17] DEBUG 
i.p.s.s.logs.OperationProcessor - OperationProcessor[29]: CommitSuccess (TxnId 
= 51, LastFullySerializedSN = 505834, LastStartedSN = 505834, Address = 
Sequence = 17179869235, LedgerId = 3538, EntryId = 51, Length = 35244).
   2022-12-05 15:26:53,197 8833564 [core-28] DEBUG 
i.p.s.s.logs.OperationProcessor - OperationProcessor[29]: CommitSuccess (TxnId 
= 52, LastFullySerializedSN = 506120, LastStartedSN = 506120, Address = 
Sequence = 17179869236, LedgerId = 3538, EntryId = 52, Length = 783197).
   ```
   
   3. Post that, we see that, probably due to the network fault-injection test 
induced, there is a ensamble change that impacts entry `53` onwards for that 
ledger:
   ```
   2022-12-05 15:27:07,801 8848168 [BookKeeperClientWorker-OrderedExecutor-2-0] 
DEBUG o.a.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:3538, 
change-id:0000000001)][attempt:1] changing ensemble from: 
[bookkeeper-bookie-2-13057, bookkeeper-bookie-1-21445] to: 
[bookkeeper-bookie-2-13057, bookkeeper-bookie-0-27781] starting at entry: 53
   ```
   
   4. We keep getting some errors and acknowledgements writing entries from 
`53` onwards, until we get this error:
   ```
   2022-12-05 15:27:15,431 8855798 [ZKC-connect-executor-0-EventThread] ERROR 
o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=3538,loopId=1ec0500f) 
Error writing metadata to store
     Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:15,432 8855799 [BookKeeperClientWorker-OrderedExecutor-2-0] 
WARN  o.a.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:3538, 
change-id:0000000001)][attempt:1] Exception changing ensemble
     Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:15,433 8855800 [BookKeeperClientWorker-OrderedExecutor-2-0] 
ERROR o.a.bookkeeper.client.LedgerHandle - Closing ledger 3538 due to 
ZKException: Error while using ZooKeeper
   ```
   
   5. After that, the zNode for that very same ledger `3538` seems to be 
deleted:
   ```
   2022-12-05 15:27:19,429 8859796 
[ZKC-connect-executor-0-SendThread(zookeeper-client:2181)] DEBUG 
org.apache.zookeeper.ClientCnxn - Got WatchedEvent state:SyncConnected 
type:NodeDeleted path:/pravega/pravega/bookkeeper/ledgers/00/0000/L3538 for 
session id 0x3004cfb266c0003
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-EventThread] DEBUG 
o.a.b.meta.AbstractZkLedgerManager - Received watched event WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/pravega/pravega/bookkeeper/ledgers/00/0000/L3538 from zookeeper based 
ledger manager.
   2022-12-05 15:27:19,429 8859796 [ZKC-connect-executor-0-EventThread] DEBUG 
o.a.b.meta.AbstractZkLedgerManager - No ledger metadata listeners to remove 
from ledger 3538 after it's deleted.
   ```
   
   6. And it seems deleted because, after that, we just see `NoNodeException` 
when we try to access the Zookeeper metadata of this ledger:
   ```
   2022-12-05 15:27:23,218 8863585 [ZKC-connect-executor-0-EventThread] DEBUG 
o.a.b.meta.AbstractZkLedgerManager - No such ledger: 3538
   
segmentstore_2022-12-05_15.640.log.gz:org.apache.zookeeper.KeeperException$NoNodeException:
 KeeperErrorCode = NoNode for /pravega/pravega/bookkeeper/ledgers/00/0000/L3538
   2022-12-05 15:27:23,218 8863585 [ZKC-connect-executor-0-EventThread] ERROR 
o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=3538,loopId=4e9e5246) 
Failed to read metadata from store
   2022-12-05 15:27:23,219 8863586 [core-19] ERROR 
i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Unable to close LedgerHandle for 
Ledger 3538.
      io.pravega.segmentstore.storage.DurableDataLogException: Unable to close 
ledger 3538.
   ```
   
   Once the issue was reported, I tried to dig a little more with the Pravega 
and Bookkeeper CLIs:
   
   - From Pravega side, if we inspect the `BookkeeperLog[29]`, we clearly see 
that ledger `3538` is not accessible:
   ```
   > bk details 29
   {
     "key": "log_summary",
     "value": {
       "logId": 29,
       "epoch": 9,
       "version": 511,
       "enabled": true,
       "ledgers": 4,
       "truncation": "Sequence = 12884902323, LedgerId = 3491, EntryId = 435"
     }
   }
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3491,
       "seq": 3,
       "status": "NotEmpty",
       "lac": 971,
       "length": 0,
       "numBookies": 3,
       "numFragments": 2,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: 
[bookkeeper-bookie-0-27781,bookkeeper-bookie-1-21445],972: 
[bookkeeper-bookie-3-9007,bookkeeper-bookie-1-21445]"
     }
   }
   Exception executing BK details command: Unable to open-read ledger 3538.
        Ledger 3538: Seq = 4, Status = NotEmpty. BK: Unable to open-read ledger 
3538.
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3720,
       "seq": 8,
       "status": "Empty",
       "lac": -1,
       "length": 0,
       "numBookies": 2,
       "numFragments": 1,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-2-13057,bookkeeper-bookie-3-9007]"
     }
   }
   {
     "key": "ledger_details",
     "value": {
       "ledger": 3738,
       "seq": 9,
       "status": "Unknown",
       "lac": -1,
       "length": 0,
       "numBookies": 2,
       "numFragments": 1,
       "ensembleSize": 2,
       "writeQuorumSize": 2,
       "ackQuorumSize": 2,
       "ensembles": "0: [bookkeeper-bookie-0-27781,bookkeeper-bookie-3-9007]"
     }
   }
   ```
   
   - Using the Bookkeeper CLI, if we list all the ledgers in the system, we can 
see ledger `3538`:
   ```
   bin/bookkeeper shell listledgers
   ...
   17:39:21,474 INFO  ledgerID: 3537
   17:39:21,474 INFO  ledgerID: 3538     <<<<<< HERE IT IS
   17:39:21,474 INFO  ledgerID: 3539
   ...
   ```
   - However, it is not possible to get its contents or metadata:
   
   ```
   ===== LEDGER: 3538 =====
   master key  : 850bf1071c5e3d8c24235676f8816ae0cbe2f14f
   size         : 0
   entries      : 0
   isFenced     : true
   ===== LEDGER: 3538 =====
   ```
   
   ```
   bin/bookkeeper shell readledger -ledgerid 3538
   
   17:37:18,747 INFO  Successfully connected to bookie: 
bookkeeper-bookie-1-21445 [id: 0x45c77aa8, L:/10.128.0.116:33870 - 
R:10.129.0.171/10.129.0.171:3181]
   17:37:18,748 INFO  connection [id: 0x45c77aa8, L:/10.128.0.116:33870 - 
R:10.129.0.171/10.129.0.171:3181] authenticated as 
BookKeeperPrincipal{ANONYMOUS}
   17:37:18,755 ERROR Read of ledger entry failed: L3538 E0-E0, Sent to 
[bookkeeper-bookie-2-13057, bookkeeper-bookie-1-21445], Heard from [] : bitset 
= {}, Error = 'No such ledger exists on Bookies'. First unread entry is (-1, rc 
= null)
   17:37:18,761 ERROR Error reading entry 0 from ledger 3538
   org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsException: No 
such ledger exists on Bookies
        at 
org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
 
~[org.apache.bookkeeper-bookkeeper-server-4.14.3-build-223.jar:4.14.3-build-223]
   ```
   
   ```
   bin/bookkeeper shell ledgermetadata -ledgerid 3538
   
    17:42:00,323 WARN  An exception was thrown while closing send thread for 
session 0x2004d6096890377.
   org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read 
additional data from server sessionid 0x2004d6096890377, likely server has 
closed socket
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) 
~[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
        at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 ~[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1275) 
[org.apache.zookeeper-zookeeper-3.6.2.jar:3.6.2]
   Exception in thread "main" 
com.google.common.util.concurrent.UncheckedExecutionException: 
java.util.concurrent.ExecutionException: 
org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsOnMetadataServerException:
 No such ledger exists on Metadata Server
   ```
   
   How inducing delays at the network level in the pods leads very rarely to 
this situation is something that I still don't know. I'm also attaching the 
relevant log lines from Pravega Segment Store for ledger `3538` 
([ledger-3538.log](https://github.com/apache/bookkeeper/files/10177104/ledger-3538.log)).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to