On 10/31/2017 12:22 AM, Michael Vorburger wrote: > On Tue, Oct 31, 2017 at 12:44 AM, Jamo Luhrsen <jluhr...@gmail.com > <mailto:jluhr...@gmail.com>> wrote: > > On 10/30/2017 01:29 PM, Tom Pantelis wrote: > > On Mon, Oct 30, 2017 at 4:25 PM, Sam Hague <sha...@redhat.com > <mailto:sha...@redhat.com> <mailto:sha...@redhat.com > <mailto:sha...@redhat.com>>> wrote: > > On Mon, Oct 30, 2017 at 3:02 PM, Tom Pantelis > <tompante...@gmail.com <mailto:tompante...@gmail.com> > <mailto:tompante...@gmail.com <mailto:tompante...@gmail.com>>> wrote: > > On Mon, Oct 30, 2017 at 2:49 PM, Michael Vorburger > <vorbur...@redhat.com <mailto:vorbur...@redhat.com> > <mailto:vorbur...@redhat.com <mailto:vorbur...@redhat.com>>> wrote: > > > > Hi Sam, > > > > On Mon, Oct 30, 2017 at 7:45 PM, Sam Hague > <sha...@redhat.com <mailto:sha...@redhat.com> <mailto:sha...@redhat.com > <mailto:sha...@redhat.com>>> wrote: > > > > Stephen, Michael, Tom, > > > > do you have any ways to collect debugs when ODL crashes > in CSIT? > > > > > > JVMs (almost) never "just crash" without a word... either > some code does java.lang.System.exit(), which you may > > remember we do in the CDS/Akka code somewhere, or there's a > bug in the JVM implementation - in which case there > > should be a one of those JVM crash logs type things - a > file named something like hs_err_pid22607.log in the > > "current working" directory. Where would that be on these > CSIT runs, and are the CSIT JJB jobs set up to preserve > > such JVM crash log files and copy them over to > logs.opendaylight.org <http://logs.opendaylight.org> > <http://logs.opendaylight.org> ? > > > > > > Akka will do System.exit() if it encounters an error serious > for that. But it doesn't do it silently. However I > > believe we disabled the automatic exiting in akka. > > > > Should there be any logs in ODL for this? There is nothing in the > karaf log when this happens. It literally just stops. > > > > The karaf.console log does say the karaf process was killed: > > > > /tmp/karaf-0.7.1-SNAPSHOT/bin/karaf: line 422: 11528 Killed > ${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} "$NON_BLOCKING_PRNG" > > -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" > -Djava.ext.dirs="${JAVA_EXT_DIRS}" > > -Dkaraf.instances="${KARAF_HOME}/instances" > -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}" > > -Dkaraf.data="${KARAF_DATA}" -Dkaraf.etc="${KARAF_ETC}" > -Dkaraf.restart.jvm.supported=true > > -Djava.io.tmpdir="${KARAF_DATA}/tmp" > -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util.logging.properties" > > ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" -classpath > "${CLASSPATH}" ${MAIN} > > > > In the CSIT robot files we can see the below connection errors so > ODL is not responding to new requests. This plus the > > above lead to think ODL just died. > > > > [ WARN ] Retrying (Retry(total=2, connect=None, read=None, > redirect=None, status=None)) after connection broken by > > > 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection > object at 0x5ca2d50>: Failed to establish a new > > connection: [Errno 111] Connection refused',)' > > > > > > > > That would seem to indicate something did a kill -9. As Michael said, > if the JVM crashed there would be an hs_err_pid file > > and it would log a message about it > > yeah, this is where my money is at as well. The OS must be dumping it > because it's > misbehaving. I'll try to hack the job to start collecting os level log > info (e.g. journalctl, etc) > > > JamO, do make sure you collect not just OS level but also the JVM's > hs_err_*.log file (if any); my bet is a JVM more than an > OS level crash...
where are these hs_err_*.log files going to be? This is such a dragged out process to debug. These jobs take 3+ hours and our problem only comes sporadically. ...sigh... But, good news is that I think we've confirmed it's an oom. but an OOM from the OS perspective, if I'm not mistaken. here's what I saw in a sandbox job [a] that just hit this: Out of memory: Kill process 11546 (java) score 933 or sacrifice child (more debug output is there in the console log) These ODL systems start with 4G and we are setting the max mem for the odl java process to be 2G. I don't think we see this with Carbon, which makes me believe it's *not* some problem from outside of ODL (e.g. not a kernel bug from when we updated the java builder image back on 10/20) I'll keep digging at this. Ideas are welcome for things to look at. [a] https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-pike-jamo-upstream-stateful-snat-conntrack-oxygen/7/consoleFull > BTW: The most common fix ;) for JVM crashes often is simply upgrading to the > latest available patch version of OpenJDK.. but > I'm guessing/hoping we run from RPM and already have the latest - or is this > possibly running on an older JVM version package > that was somehow "held back" via special dnf instructions, or manually > installed from a ZIP, kind of thing? these systems are built and updated periodically. jdk is installed with "yum install". The specific version in [a] is: 10:57:33 Set Java version 10:57:34 JDK default version... 10:57:34 openjdk version "1.8.0_144" 10:57:34 OpenJDK Runtime Environment (build 1.8.0_144-b01) 10:57:34 OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode) Thanks, JamO > JamO > > > > > > _______________________________________________ > > controller-dev mailing list > > controller-dev@lists.opendaylight.org > <mailto:controller-dev@lists.opendaylight.org> > > https://lists.opendaylight.org/mailman/listinfo/controller-dev > <https://lists.opendaylight.org/mailman/listinfo/controller-dev> > > > _______________________________________________ > controller-dev mailing list > controller-dev@lists.opendaylight.org > <mailto:controller-dev@lists.opendaylight.org> > https://lists.opendaylight.org/mailman/listinfo/controller-dev > <https://lists.opendaylight.org/mailman/listinfo/controller-dev> > > _______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev