[ 
https://issues.apache.org/jira/browse/HBASE-15537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228192#comment-15228192
 ] 

Duo Zhang commented on HBASE-15537:
-----------------------------------

For master, I tried {{TestNamespaceCommands}} locally, it ran really slow, 
haven't found the reason yet.

And for branch-1, TestFailedAppendAndSync is failed because of timeout. I see 
the log, there must be some corner cases that have not been handled.

This is the failed test output
https://builds.apache.org/job/PreCommit-HBASE-Build/1305/testReport/org.apache.hadoop.hbase.regionserver/TestFailedAppendAndSync/testLockupAroundBadAssignSync/
{noformat}
2016-04-06 11:04:38,070 ERROR [sync.2] wal.FSHLog$SyncRunner(1239): Error 
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
        at 
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
        at java.lang.Thread.run(Thread.java:745)
2016-04-06 11:04:38,071 DEBUG [Thread-4] regionserver.LogRoller(139): WAL roll 
requested
2016-04-06 11:04:38,071 DEBUG [Time-limited test] regionserver.HRegion(3842): 
rollbackMemstore rolled back 1
2016-04-06 11:04:38,148 ERROR [sync.3] wal.FSHLog$SyncRunner(1239): Error 
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
        at 
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
        at java.lang.Thread.run(Thread.java:745)
2016-04-06 11:04:38,151 INFO  [Thread-4] wal.FSHLog(870): Rolled WAL 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459940677946
 with entries=1, filesize=255 B; new WAL 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459940678071
2016-04-06 11:09:35,215 INFO  [main] regionserver.TestFailedAppendAndSync(93): 
Cleaning test directory: 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/hbase-server/target/test-data/3b0ad6d4-bf70-4159-8463-9c5accf75071
{noformat}

You can see that the wal roll is succeeded(we expected an abort here caused by 
wal roll fail). This is the typical log
{noformat}
2016-04-06 20:20:21,352 ERROR [sync.2] wal.FSHLog$SyncRunner(1239): Error 
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
        at 
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
        at java.lang.Thread.run(Thread.java:745)
2016-04-06 20:20:21,353 DEBUG [Time-limited test] regionserver.HRegion(3842): 
rollbackMemstore rolled back 1
2016-04-06 20:20:21,354 DEBUG [Thread-4] regionserver.LogRoller(139): WAL roll 
requested
2016-04-06 20:20:21,378 ERROR [sync.3] wal.FSHLog$SyncRunner(1239): Error 
syncing, request close of WAL
java.io.IOException: FAKE! Failed to replace a bad datanode...
        at 
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
        at java.lang.Thread.run(Thread.java:745)
2016-04-06 20:20:21,378 ERROR [Thread-4] wal.FSHLog(881): Failed close of WAL 
writer 
/home/zhangduo/hbase/code/hbase-server/target/test-data/dba1afc0-933c-4ac0-ad0c-1688e8e152b5/TestHRegiontestLockupAroundBadAssignSync/testLockupAroundBadAssignSync/wal.1459945205555,
 unflushedEntries=7
org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: 
java.io.IOException: FAKE! Failed to replace a bad datanode...
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:1615)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.replaceWriter(FSHLog.java:833)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:699)
        at 
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog.rollWriter(TestFailedAppendAndSync.java:122)
        at 
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:148)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: FAKE! Failed to replace a bad datanode...
        at 
org.apache.hadoop.hbase.regionserver.TestFailedAppendAndSync$1DodgyFSLog$1.sync(TestFailedAppendAndSync.java:139)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:1235)
        ... 1 more
2016-04-06 20:21:07,405 INFO  [Thread-4] regionserver.LogRoller(176): LogRoller 
exiting.
{noformat}

You can see that, the second sync error will cause a 
FailedSyncBeforeLogCloseException and trigger an abort.

Can not reproduce it locally right now. Open a issue for it? This maybe a 
dataloss issue...[~stack]

Thanks.

> Make multi WAL work with WALs other than FSHLog
> -----------------------------------------------
>
>                 Key: HBASE-15537
>                 URL: https://issues.apache.org/jira/browse/HBASE-15537
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>             Fix For: 2.0.0, 1.3.0, 1.4.0
>
>         Attachments: HBASE-15537-branch-1.patch, HBASE-15537-v3.patch, 
> HBASE-15537-v4.patch, HBASE-15537-v5.patch, HBASE-15537-v6.patch, 
> HBASE-15537.patch, HBASE-15537_v2.patch
>
>
> The multi WAL should not be bound with {{FSHLog}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to