Kevin Pullin created FLINK-8309: ----------------------------------- Summary: JVM sigsegv crash when enabling async checkpoints Key: FLINK-8309 URL: https://issues.apache.org/jira/browse/FLINK-8309 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Affects Versions: 1.3.2, 1.4.0 Environment: macOS 10.13.2 & Ubuntu 16.04.03 using JVM 1.8.0_151. Reporter: Kevin Pullin Attachments: StreamingJob.scala
h4. Summary I have a streaming job with async checkpointing enabled. The job is crashing the JVM with a SIGSEGV error coinciding with checkpoint completion. Workarounds are noted below. I thought this was worth documenting in case someone runs into similar issues or if a fix is possible. h4. Job Overview & Observations The job itself stores a large quantity of `case class` objects in `valueState`s contained within a `RichFilterFunction`. This data is used for deduplicating events. The crash stops by: - moving the case class outside of the anonymous RichFilterFunction class. - reducing the number of objects stored in the valueState. - reducing the size of the objects stored in the valueState. - disabling async snapshots. I can provide additional crash data as needed (core dumps, error logs, etc). The StateBackend implementation doesn't matter; the job fails using the Memory, Fs, and RocksDb backends. >From what I understand anonymous classes should be avoided with checkpointing >as the name isn't stable, so that seems like the best route for me. h4. Reproduction case The attached a `StreamingJob.scala` file that contains a minimal repo case, which closely aligns with my actual job configuration. Running it consistently crashes the JVM upon completion of the first checkpoint. My tests runs set only two JVM options => -Xms4g -Xmx4g h4. Crash output Here's a crash captured from Ubuntu: {noformat} [info] # [info] # A fatal error has been detected by the Java Runtime Environment: [info] # [info] # SIGSEGV (0xb) at pc=0x00007fd192b92c1c, pid=7191, tid=0x00007fd0873f3700 [info] # [info] # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12) [info] # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops) [info] # Problematic frame: [info] # C [libzip.so+0x5c1c] [info] # [info] # Core dump written. Default location: /home/ubuntu/flink-project/core or core.7191 [info] # [info] # An error report file with more information is saved as: [info] # /home/XXX/flink-project/hs_err_pid7191.log [info] Compiled method (nm) 71547 81 n 0 java.util.zip.ZipFile::getEntry (native) [info] total in heap [0x00007fd17d12e290,0x00007fd17d12e600] = 880 [info] relocation [0x00007fd17d12e3b8,0x00007fd17d12e400] = 72 [info] main code [0x00007fd17d12e400,0x00007fd17d12e600] = 512 [info] # [info] # If you would like to submit a bug report, please visit: [info] # http://bugreport.java.com/bugreport/crash.jsp [info] # {noformat} And one from macOS: {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x0000000105264c48, pid=30848, tid=0x0000000000003403 # # JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode bsd-amd64 compressed oops) # Problematic frame: # V [libjvm.dylib+0x464c48] # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /Users/XXX/src/etc_flink_mwx/hs_err_pid30848.log [thread 30211 also had an error] # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Process finished with exit code 134 (interrupted by signal 6: SIGABRT) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)