On Mon, Jul 2, 2018 at 2:44 PM Tom Pantelis <tompante...@gmail.com> wrote:
> > > On Mon, Jul 2, 2018 at 2:15 PM, Victor Pickard <vpick...@redhat.com> > wrote: > >> Hi all, >> >> I'm looking at clustering stability. One of the jobs I've been looking at is >> controller clustering. This is a good CSIT, in that it stops and starts ODL >> several times during the run. >> >> In one of failed test runs (sandbox, logs wiped from last week, but I do >> have this particular karaf log archived locally), ODL is started, and rest >> calls fail during the test. Looking at the logs, I can see why. Karaf failed >> to start, or better yet, took a really long time to start. From the snipped >> below, you can see about 7 mins between when Karaf launched, and did >> something?, maybe restarted again. But the main thing is that karaf failed >> to start in a timely manner, taking over 7 minutes to begin to start up >> blueprints, etc. >> >> >> I ran a job that had karaf debug logging enabled with this setting: >> >> log4j.rootLogger=DEBUG >> >> >> This did not go very well. This generates way too much debug info, and was >> causing timeouts and other various errors in the CSIT run. >> >> >> So, my questions are: >> >> 1. Has anyone see this issue where karaf seems to hang on startup (after a >> kill -9 on karaf pid)? If so, is this a known issue? >> >> 2. What debug would be needed to figure out why karaf was hanging? Note the >> above generated a log file of ~768 MB in a very short timespan. >> >> >> Vic - does this happen if you gracefully shut it down? > Hi Tom, I haven't tried that. I'm just running the controller csit, which does a kill -9 on karaf pid. > In years past with karaf I recall corruption could occur in the bundle > cache under data if the karaf process was killed. I don't know if that > potential issue is still present with karaf 4. Does it clean the data dir > before restarting? If not, it would be good to do so to be safe. > Here are the steps in from the controller csit job for restarting ODL (Restart Odl With Tell Based False). Looking at this, yes, the data dir is deleted. 1. kill -9 on karaf pid ( 'ps axf | grep org.apache.karaf | grep -v grep | awk '{print "kill -9 " $1}' | sh' ) 2. Verify karaf is not running 3. Set Tell Based to False in config file 4. Copy karaf logs to /tmp 5. Clean the following directories 1. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/tmp/ 2. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/data/ 3. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/cache/ 4. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/snapshots/ 5. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/journal/ 6. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/etc/opendaylight/current/ 7. rm -rf /tmp/karaf-0.8.3-SNAPSHOT/etc/host.key 6. Copy logs back to new snapshot dir, as below: 1. mkdir -p '/tmp/karaf-0.8.3-SNAPSHOT/data' && rm -vrf '/tmp/karaf-0.8.3-SNAPSHOT/log' && mv -vf '/tmp/log' '/tmp/karaf-0.8.3-SNAPSHOT/data/ > Other than that, we probably need to get a thread dump. > >> Thanks, >> >> Vic >> >> >> >> >> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch >> INFO: Installing and starting initial bundles >> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.Main launch >> INFO: All initial bundles installed and set to start >> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock >> INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock >> Jun 29, 2018 3:43:47 PM org.apache.karaf.main.lock.SimpleFileLock lock >> INFO: Lock acquiredJun 29, 2018 3:43:47 PM >> org.apache.karaf.main.Main$KarafLockCallback lockAquired >> INFO: Lock acquired. Setting startlevel to 100 >> Jun 29, 2018 3:50:48 PM org.apache.karaf.main.Main launch >> INFO: Installing and starting initial bundles >> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main launch >> INFO: All initial bundles installed and set to start >> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock >> INFO: Trying to lock /tmp/karaf-0.8.3-SNAPSHOT/lock >> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.lock.SimpleFileLock lock >> INFO: Lock acquired >> Jun 29, 2018 3:50:49 PM org.apache.karaf.main.Main$KarafLockCallback >> lockAquired >> INFO: Lock acquired. Setting startlevel to 100 >> >> >> >> _______________________________________________ >> controller-dev mailing list >> controller-dev@lists.opendaylight.org >> https://lists.opendaylight.org/mailman/listinfo/controller-dev >> >> >
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev