Kevin Pullin created FLINK-8309:
-----------------------------------

             Summary: JVM sigsegv crash when enabling async checkpoints
                 Key: FLINK-8309
                 URL: https://issues.apache.org/jira/browse/FLINK-8309
             Project: Flink
          Issue Type: Bug
          Components: State Backends, Checkpointing
    Affects Versions: 1.3.2, 1.4.0
         Environment: macOS 10.13.2 & Ubuntu 16.04.03 using JVM 1.8.0_151.
            Reporter: Kevin Pullin
         Attachments: StreamingJob.scala

h4. Summary

I have a streaming job with async checkpointing enabled. The job is crashing 
the JVM with a SIGSEGV error coinciding with checkpoint completion.

Workarounds are noted below. I thought this was worth documenting in case 
someone runs into similar issues or if a fix is possible.

h4. Job Overview & Observations

The job itself stores a large quantity of `case class` objects in `valueState`s 
contained within a `RichFilterFunction`. This data is used for deduplicating 
events.

The crash stops by:
 - moving the case class outside of the anonymous RichFilterFunction class.
 - reducing the number of objects stored in the valueState.
 - reducing the size of the objects stored in the valueState.
 - disabling async snapshots.

I can provide additional crash data as needed (core dumps, error logs, etc).  
The StateBackend implementation doesn't matter; the job fails using the Memory, 
Fs, and RocksDb backends.

>From what I understand anonymous classes should be avoided with checkpointing 
>as the name isn't stable, so that seems like the best route for me.

h4. Reproduction case

The attached a `StreamingJob.scala` file that contains a minimal repo case, 
which closely aligns with my actual job configuration.  Running it consistently 
crashes the JVM upon completion of the first checkpoint.

My tests runs set only two JVM options => -Xms4g -Xmx4g

h4. Crash output

Here's a crash captured from Ubuntu:

{noformat}

[info] #
[info] # A fatal error has been detected by the Java Runtime Environment:
[info] #
[info] #  SIGSEGV (0xb) at pc=0x00007fd192b92c1c, pid=7191, 
tid=0x00007fd0873f3700
[info] #
[info] # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 
1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
[info] # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 
compressed oops)
[info] # Problematic frame:
[info] # C  [libzip.so+0x5c1c]
[info] #
[info] # Core dump written. Default location: /home/ubuntu/flink-project/core 
or core.7191
[info] #
[info] # An error report file with more information is saved as:
[info] # /home/XXX/flink-project/hs_err_pid7191.log
[info] Compiled method (nm)   71547   81     n 0       
java.util.zip.ZipFile::getEntry (native)
[info]  total in heap  [0x00007fd17d12e290,0x00007fd17d12e600] = 880
[info]  relocation     [0x00007fd17d12e3b8,0x00007fd17d12e400] = 72
[info]  main code      [0x00007fd17d12e400,0x00007fd17d12e600] = 512
[info] #
[info] # If you would like to submit a bug report, please visit:
[info] #   http://bugreport.java.com/bugreport/crash.jsp
[info] #
{noformat}

And one from macOS:

{noformat}

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000105264c48, pid=30848, tid=0x0000000000003403
#
# JRE version: Java(TM) SE Runtime Environment (8.0_151-b12) (build 
1.8.0_151-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.151-b12 mixed mode bsd-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.dylib+0x464c48]
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/XXX/src/etc_flink_mwx/hs_err_pid30848.log
[thread 30211 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to