[
https://issues.apache.org/jira/browse/HBASE-15537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228192#comment-15228192
]
Duo Zhang commented on HBASE-15537:
-----------------------------------
For master, I tried {{TestNamespaceCommands}} locally, it ran really slow,
haven't found the reason yet.
And for branch-1, TestFailedAppendAndSync is failed because of timeout. I see
the log, there must be some corner cases that have not been handled.
This is the failed test output
https://builds.apache.org/job/PreCommit-HBASE-Build/1305/testReport/org.apache.hadoop.hbase.regionserver/TestFailedAppendAndSync/testLockupAroundBadAssignSync/
{noformat}
2016-04-06 11:04:38,070 ERROR [sync.2] wal.FSHLog$SyncRunner(1239): Error
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
at
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
at java.lang.Thread.run(Thread.java:745)
2016-04-06 11:04:38,071 DEBUG [Thread-4] regionserver.LogRoller(139): WAL roll
requested
2016-04-06 11:04:38,071 DEBUG [Time-limited test] regionserver.HRegion(3842):
rollbackMemstore rolled back 1
2016-04-06 11:04:38,148 ERROR [sync.3] wal.FSHLog$SyncRunner(1239): Error
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
at
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
at java.lang.Thread.run(Thread.java:745)
2016-04-06 11:04:38,151 INFO [Thread-4] wal.FSHLog(870): Rolled WAL
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459940677946
with entries=1, filesize=255 B; new WAL
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459940678071
2016-04-06 11:09:35,215 INFO [main] regionserver.TestFailedAppendAndSync(93):
Cleaning test directory:
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071
{noformat}
You can see that the wal roll is succeeded(we expected an abort here caused by
wal roll fail). This is the typical log
{noformat}
2016-04-06 20:20:21,352 ERROR [sync.2] wal.FSHLog$SyncRunner(1239): Error
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
at
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
at java.lang.Thread.run(Thread.java:745)
2016-04-06 20:20:21,353 DEBUG [Time-limited test] regionserver.HRegion(3842):
rollbackMemstore rolled back 1
2016-04-06 20:20:21,354 DEBUG [Thread-4] regionserver.LogRoller(139): WAL roll
requested
2016-04-06 20:20:21,378 ERROR [sync.3] wal.FSHLog$SyncRunner(1239): Error
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
at
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
at java.lang.Thread.run(Thread.java:745)
2016-04-06 20:20:21,378 ERROR [Thread-4] wal.FSHLog(881): Failed close of WAL
writer
/home/zhangduo/hbase/code/hbase-server/target/test-data/dba1afc0-933c-4ac0-ad0c-1688e8e152b5/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459945205555,
unflushedEntries=7
org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException:
java.io.IOException: FAKE! Failed to replace a bad datanode...
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1615)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:833)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:699)
at
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog.rollWriter(TestFailedAppendAndSync.java:122)
at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: FAKE! Failed to replace a bad datanode...
at
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
... 1 more
2016-04-06 20:21:07,405 INFO [Thread-4] regionserver.LogRoller(176): LogRoller
exiting.
{noformat}
You can see that, the second sync error will cause a
FailedSyncBeforeLogCloseException and trigger an abort.
Can not reproduce it locally right now. Open a issue for it? This maybe a
dataloss issue...[~stack]
Thanks.
> Make multi WAL work with WALs other than FSHLog
> -----------------------------------------------
>
> Key: HBASE-15537
> URL: https://issues.apache.org/jira/browse/HBASE-15537
> Project: HBase
> Issue Type: Sub-task
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Fix For: 2.0.0, 1.3.0, 1.4.0
>
> Attachments: HBASE-15537-branch-1.patch, HBASE-15537-v3.patch,
> HBASE-15537-v4.patch, HBASE-15537-v5.patch, HBASE-15537-v6.patch,
> HBASE-15537.patch, HBASE-15537_v2.patch
>
>
> The multi WAL should not be bound with {{FSHLog}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)