[
https://issues.apache.org/jira/browse/HBASE-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843245#comment-16843245
]
stack commented on HBASE-22289:
-------------------------------
Fails here:
{code}
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException:
ExecutionException The forked VM terminated without properly saying goodbye. VM
crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /testptch/hbase/hbase-server &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -enableassertions
-Dhbase.build.id=2019-05-18T18:07:33Z -Xmx2800m
-Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true
-Djava.awt.headless=true -Djdk.net.URLClassPath.disableClassPathURLCheck=true
-jar
/testptch/hbase/hbase-server/target/surefire/surefirebooter969393008550780328.jar
/testptch/hbase/hbase-server/target/surefire 2019-05-18T18-08-18_047-jvmRun2
surefire8098665633275435126tmp surefire_10648533127163597806951tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] org.apache.hadoop.hbase.regionserver.TestHRegionWithInMemoryFlush
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:494)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:441)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:293)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:245)
[ERROR] at
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1194)
[ERROR] at
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1022)
[ERROR] at
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:868)
[ERROR] at
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[ERROR] at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
[ERROR] at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
[ERROR] at
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
[ERROR] at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[ERROR] at
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[ERROR] at
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[ERROR] at
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[ERROR] at
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
[ERROR] at
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[ERROR] at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[ERROR] at org.apache.maven.cli.MavenCli.execute(MavenCli.java:954)
[ERROR] at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
[ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
[ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR] at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR] at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR] at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR] at
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
[ERROR] at
org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
[ERROR] at
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
[ERROR] at
org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
[ERROR] Caused by:
org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM
terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /testptch/hbase/hbase-server &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -enableassertions
-Dhbase.build.id=2019-05-18T18:07:33Z -Xmx2800m
-Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true
-Djava.awt.headless=true -Djdk.net.URLClassPath.disableClassPathURLCheck=true
-jar
/testptch/hbase/hbase-server/target/surefire/surefirebooter969393008550780328.jar
/testptch/hbase/hbase-server/target/surefire 2019-05-18T18-08-18_047-jvmRun2
surefire8098665633275435126tmp surefire_10648533127163597806951tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] org.apache.hadoop.hbase.regionserver.TestHRegionWithInMemoryFlush
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:671)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:533)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter.access$600(ForkStarter.java:115)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:429)
[ERROR] at
org.apache.maven.plugin.surefire.booterclient.ForkStarter$2.call(ForkStarter.java:406)
[ERROR] at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[ERROR] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[ERROR] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[ERROR] at java.lang.Thread.run(Thread.java:748)
[ERROR] -> [Help 1]
{code}
Before this, it was in [ERROR] Tests run: 21, Failures: 0, Errors: 1, Skipped:
0, Time elapsed: 449.006 s <<< FAILURE! - in
org.apache.hadoop.hbase.util.TestFromClientSide3WoUnsafe
[ERROR]
testMultiRowMutations(org.apache.hadoop.hbase.util.TestFromClientSide3WoUnsafe)
Time elapsed: 27.04 s <<< ERROR!
org.apache.hadoop.hbase.client.RetriesExhaustedException:
... and [ERROR] Command was /bin/sh -c cd /testptch/hbase/hbase-server &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -enableassertions
-Dhbase.build.id=2019-05-18T03:38:39Z -Xmx2800m
-Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true
-Djava.awt.headless=true -Djdk.net.URLClassPath.disableClassPathURLCheck=true
-jar
/testptch/hbase/hbase-server/target/surefire/surefirebooter477464835149665861.jar
/testptch/hbase/hbase-server/target/surefire 2019-05-18T03-39-27_504-jvmRun5
surefire1809558336861907377tmp surefire_9163845162714239793346tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR]
org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleWAL
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException:
ExecutionException The forked VM terminated without properly saying goodbye. VM
crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /testptch/hbase/hbase-server &&
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -enableassertions
-Dhbase.build.id=2019-05-18T03:38:39Z -Xmx2800m
-Djava.security.egd=file:/dev/./urandom -Djava.net.preferIPv4Stack=true
-Djava.awt.headless=true -Djdk.net.URLClassPath.disableClassPathURLCheck=true
-jar
/testptch/hbase/hbase-server/target/surefire/surefirebooter477464835149665861.jar
/testptch/hbase/hbase-server/target/surefire 2019-05-18T03-39-27_504-jvmRun5
surefire1809558336861907377tmp surefire_9163845162714239793346tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR]
org.apache.hadoop.hbase.replication.multiwal.TestReplicationKillMasterRSCompressedWithMultipleWAL
... Let me try again.
> WAL-based log splitting resubmit threshold may result in a task being stuck
> forever
> -----------------------------------------------------------------------------------
>
> Key: HBASE-22289
> URL: https://issues.apache.org/jira/browse/HBASE-22289
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.1.0, 1.5.0
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Fix For: 2.1.5
>
> Attachments: HBASE-22289.01-branch-2.1.patch,
> HBASE-22289.02-branch-2.1.patch, HBASE-22289.03-branch-2.1.patch,
> HBASE-22289.branch-2.1.001.patch, HBASE-22289.branch-2.1.001.patch,
> HBASE-22289.branch-2.1.001.patch
>
>
> Not sure if this is handled better in procedure based WAL splitting; in any
> case it affects versions before that.
> The problem is not in ZK as such but in internal state tracking in master, it
> seems.
> Master:
> {noformat}
> 2019-04-21 01:49:49,584 INFO
> [master/<master>:17000.splitLogManager..Chore.1]
> coordination.SplitLogManagerCoordination: Resubmitting task
> <path>.1555831286638
> {noformat}
> worker-rs, split fails
> {noformat}
> ....
> 2019-04-21 02:05:31,774 INFO
> [RS_LOG_REPLAY_OPS-regionserver/<worker-rs>:17020-1] wal.WALSplitter:
> Processed 24 edits across 2 regions; edits skipped=457; log
> file=<path>.1555831286638, length=2156363702, corrupted=false, progress
> failed=true
> {noformat}
> Master (not sure about the delay of the acquired-message; at any rate it
> seems to detect the failure fine from this server)
> {noformat}
> 2019-04-21 02:11:14,928 INFO [main-EventThread]
> coordination.SplitLogManagerCoordination: Task <path>.1555831286638 acquired
> by <worker-rs>,17020,1555539815097
> 2019-04-21 02:19:41,264 INFO
> [master/<master>:17000.splitLogManager..Chore.1]
> coordination.SplitLogManagerCoordination: Skipping resubmissions of task
> <path>.1555831286638 because threshold 3 reached
> {noformat}
> After that this task is stuck in the limbo forever with the old worker, and
> never resubmitted.
> RS never logs anything else for this task.
> Killing the RS on the worker unblocked the task and some other server did the
> split very quickly, so seems like master doesn't clear the worker name in its
> internal state when hitting the threshold... master never restarted so
> restarting the master might have also cleared it.
> This is extracted from splitlogmanager log messages, note the times.
> {noformat}
> 2019-04-21 02:2 1555831286638=last_update = 1555837874928 last_version = 11
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20,
> ....
> 2019-04-22 11:1 1555831286638=last_update = 1555837874928 last_version = 11
> cur_worker_name = <worker-rs>,17020,1555539815097 status = in_progress
> incarnation = 3 resubmits = 3 batch = installed = 24 done = 3 error = 20}
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)