dlg99 commented on issue #3734: URL: https://github.com/apache/bookkeeper/issues/3734#issuecomment-1409599047
@hangc0276 Thank you for looking at this problem! > I suggest reverting the PR https://github.com/apache/bookkeeper/pull/3653 on branch-4.14 and branch-4.15. For the master branch, we keep the PR and try to upgrade the RocksDB version to 7.8+ to see if the segfault issue is resolved. This means that time to confirm the fix goes into the remote future, Pulsar 2.10/2.11 use bk 4.15 IIRC. I think we still should try to upgrade RocksDB. I'd be ok with upgraded db backported to 4.14/4.15 if we can guarantee safe downgrade. Currently we've downgraded BK on prod so this problem is no longer happening, unfortunately it means I don't have any logs/dumps and it really happened only one time. I've spent some time experimenting with code/injecting errors. With this: ```java diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java index 3f6d1ae55b..03acfecc87 100644 --- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java +++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java @@ -26,6 +26,8 @@ import java.io.IOException; import java.util.Map.Entry; import java.util.Set; import java.util.concurrent.TimeUnit; + +import lombok.SneakyThrows; import org.apache.bookkeeper.bookie.Bookie; import org.apache.bookkeeper.bookie.EntryLocation; import org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorage.Batch; @@ -189,6 +191,7 @@ public class EntryLocationIndex implements Closeable { deletedLedgers.add(ledgerId); } + @SneakyThrows public void removeOffsetFromDeletedLedgers() throws IOException { LongPairWrapper firstKeyWrapper = LongPairWrapper.get(-1, -1); LongPairWrapper lastKeyWrapper = LongPairWrapper.get(-1, -1); @@ -202,6 +205,7 @@ public class EntryLocationIndex implements Closeable { log.info("Deleting indexes for ledgers: {}", ledgersToDelete); long startTime = System.nanoTime(); + locationsDb.close(); try (Batch batch = locationsDb.newBatch()) { for (long ledgerId : ledgersToDelete) { if (log.isDebugEnabled()) { @@ -213,7 +217,6 @@ public class EntryLocationIndex implements Closeable { batch.deleteRange(firstKeyWrapper.array, lastKeyWrapper.array); } - batch.flush(); for (long ledgerId : ledgersToDelete) { deletedLedgers.remove(ledgerId); ``` I got rocksdb segfault ``` --------------- T H R E A D --------------- Current thread (0x00007f9dc800d000): JavaThread "main" [_thread_in_native, id=6147, stack(0x0000700003b4f000,0x0000700003c4f000)] Stack: [0x0000700003b4f000,0x0000700003c4f000], sp=0x0000700003c4d2c0, free space=1016k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [librocksdbjni13563433824350328902.jnilib+0x22e1c] Java_org_rocksdb_RocksDB_write0+0x1c j org.rocksdb.RocksDB.write0(JJJ)V+0 # ``` with [this dump](https://gist.github.com/dlg99/0459323e8a6fa0d47ac2215349e866b4) This does not look exactly as original case and more similar to https://github.com/apache/bookkeeper/pull/3043 but the question is i it possible some other rocksdb calls should not run concurrently like index update on deleted range? I've tried injecting a few other errors/cases but so far without additional success. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
