[ https://issues.apache.org/jira/browse/KAFKA-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516900#comment-17516900 ]
Peter Cipov commented on KAFKA-13684: ------------------------------------- hello [~guozhang] We have investigated this further and were able to track down the cause. It was a lack of enough native memory for our workloads, that is consumed by RocksDB. The issue manifested itself differently each time (different threads, stacks), but there always was the same signal sent by OS. So I guess this is not a bug in kstreams/rocksdb as such, nevertheless error message was pretty cryptic. Issue can be closed. > KStream rebalance can lead to JVM process crash when network issues occure > -------------------------------------------------------------------------- > > Key: KAFKA-13684 > URL: https://issues.apache.org/jira/browse/KAFKA-13684 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 2.8.1 > Reporter: Peter Cipov > Priority: Critical > Attachments: crash-dump.log, crash-logs.csv > > > Hello, > Sporadically KStream rebalance leads to segmentation fault > {code:java} > siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: > 0x0000000000000000 {code} > I have spotted it occuring when: > 1) there some intermittent connection issues. I have found > org.apache.kafka.common.errors.DisconnectException: in logs during rebalance > 2) a lot of partitions are shifted due to ks cluster re-balance > > crash stack: > {code:java} > Current thread (0x00007f5bf407a000): JavaThread "app-blue-v6-StreamThread-2" > [_thread_in_native, id=231, stack(0x00007f5bdc2ed000,0x00007f5bdc3ee000)] > Stack: [0x00007f5bdc2ed000,0x00007f5bdc3ee000], sp=0x00007f5bdc3ebe30, free > space=1019kNative frames: (J=compiled Java code, A=aot compiled Java code, > j=interpreted, Vv=VM code, C=native code)C [libc.so.6+0x37ab7] abort+0x297 > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)J 8080 > org.rocksdb.WriteBatch.put(J[BI[BIJ)V (0 bytes) @ 0x00007f5c857ca520 > [0x00007f5c857ca4a0+0x0000000000000080]J 8835 c2 > org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.prepareBatchForRestore(Ljava/util/Collection;Lorg/rocksdb/WriteBatch;)V > (52 bytes) @ 0x00007f5c858dccb4 [0x00007f5c858dcb60+0x0000000000000154]J > 9779 c1 > org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(Ljava/util/Collection;)V > (147 bytes) @ 0x00007f5c7ef7b7e4 [0x00007f5c7ef7b360+0x0000000000000484]J > 8857 c2 > org.apache.kafka.streams.processor.internals.StateRestoreCallbackAdapter.lambda$adapt$0(Lorg/apache/kafka/streams/processor/StateRestoreCallback;Ljava/util/Collection;)V > (73 bytes) @ 0x00007f5c858f86dc [0x00007f5c858f8500+0x00000000000001dc]J > 9686 c1 > org.apache.kafka.streams.processor.internals.StateRestoreCallbackAdapter$$Lambda$937.restoreBatch(Ljava/util/Collection;)V > (9 bytes) @ 0x00007f5c7dff7bb4 [0x00007f5c7dff7b40+0x0000000000000074]J 9683 > c1 > org.apache.kafka.streams.processor.internals.ProcessorStateManager.restore(Lorg/apache/kafka/streams/processor/internals/ProcessorStateManager$StateStoreMetadata;Ljava/util/List;)V > (176 bytes) @ 0x00007f5c7e71af4c [0x00007f5c7e719740+0x000000000000180c]J > 8882 c2 > org.apache.kafka.streams.processor.internals.StoreChangelogReader.restoreChangelog(Lorg/apache/kafka/streams/processor/internals/StoreChangelogReader$ChangelogMetadata;)Z > (334 bytes) @ 0x00007f5c859052ec [0x00007f5c85905140+0x00000000000001ac]J > 12689 c2 > org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(Ljava/util/Map;)V > (412 bytes) @ 0x00007f5c85ce98d4 [0x00007f5c85ce8420+0x00000000000014b4]J > 12688 c2 > org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase()V > (214 bytes) @ 0x00007f5c85ce580c [0x00007f5c85ce5540+0x00000000000002cc]J > 17654 c2 > org.apache.kafka.streams.processor.internals.StreamThread.runOnce()V (725 > bytes) @ 0x00007f5c859960e8 [0x00007f5c85995fa0+0x0000000000000148]j > org.apache.kafka.streams.processor.internals.StreamThread.runLoop()Z+61j > org.apache.kafka.streams.processor.internals.StreamThread.run()V+36v > ~StubRoutines::call_stub > siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: > 0x0000000000000000{code} > I attached whole java cash-dump and digest from our logs. > It is executed on azul jdk11 > KS 2.8.1 > -- This message was sent by Atlassian Jira (v8.20.1#820001)