[
https://issues.apache.org/jira/browse/FLINK-8309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kevin Pullin closed FLINK-8309.
-------------------------------
Resolution: Resolved
> JVM sigsegv crash when enabling async checkpoints
> -------------------------------------------------
>
> Key: FLINK-8309
> URL: https://issues.apache.org/jira/browse/FLINK-8309
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.4.0, 1.3.2
> Environment: macOS 10.13.2 & Ubuntu 16.04.03 using JVM 1.8.0_151.
> Reporter: Kevin Pullin
> Attachments: StreamingJob.scala
>
>
> h4. Summary
> I have a streaming job with async checkpointing enabled. The job is crashing
> the JVM with a SIGSEGV error coinciding with checkpoint completion.
> Workarounds are noted below. I thought this was worth documenting in case
> someone runs into similar issues or if a fix is possible.
> h4. Job Overview & Observations
> The job itself stores a large quantity of `case class` objects in
> `valueState`s contained within a `RichFilterFunction`. This data is used for
> deduplicating events.
> The crash stops by:
> - moving the case class outside of the anonymous RichFilterFunction class.
> - reducing the number of objects stored in the valueState.
> - reducing the size of the objects stored in the valueState.
> - disabling async snapshots.
> I can provide additional crash data as needed (core dumps, error logs, etc).
> The StateBackend implementation doesn't matter; the job fails using the
> Memory, Fs, and RocksDb backends.
> From what I understand anonymous classes should be avoided with checkpointing
> as the name isn't stable, so that seems like the best route for me.
> h4. Reproduction case
> The attached a `StreamingJob.scala` file that contains a minimal repo case,
> which closely aligns with my actual job configuration. Running it
> consistently crashes the JVM upon completion of the first checkpoint.
> My tests runs set only two JVM options => -Xms4g -Xmx4g
> h4. Crash output
> Here's a crash captured from Ubuntu:
> {noformat}
> [info] #
> [info] # A fatal error has been detected by the Java Runtime Environment:
> [info] #
> [info] # SIGSEGV (0xb) at pc=0x00007fd192b92c1c, pid=7191,
> tid=0x00007fd0873f3700
> [info] #
> [info] # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
> 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
> [info] # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64
> compressed oops)
> [info] # Problematic frame:
> [info] # C [libzip.so+0x5c1c]
> [info] #
> [info] # Core dump written. Default location: /home/ubuntu/flink-project/core
> or core.7191
> [info] #
> [info] # An error report file with more information is saved as:
> [info] # /home/XXX/flink-project/hs_err_pid7191.log
> [info] Compiled method (nm) 71547 81 n 0
> java.util.zip.ZipFile::getEntry (native)
> [info] total in heap [0x00007fd17d12e290,0x00007fd17d12e600] = 880
> [info] relocation [0x00007fd17d12e3b8,0x00007fd17d12e400] = 72
> [info] main code [0x00007fd17d12e400,0x00007fd17d12e600] = 512
> [info] #
> [info] # If you would like to submit a bug report, please visit:
> [info] # http://bugreport.java.com/bugreport/crash.jsp
> [info] #
> {noformat}
> And one from macOS:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x0000000105264c48, pid=30848, tid=0x0000000000003403
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build
> 1.8.0_151-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode bsd-amd64
> compressed oops)
> # Problematic frame:
> # V [libjvm.dylib+0x464c48]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /Users/XXX/src/etc_flink_mwx/hs_err_pid30848.log
> [thread 30211 also had an error]
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.java.com/bugreport/crash.jsp
> # The crash happened outside the Java Virtual Machine in native code.
> # See problematic frame for where to report the bug.
> #
> Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)