[
https://issues.apache.org/jira/browse/HBASE-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576543#comment-15576543
]
Enis Soztutar commented on HBASE-16824:
---------------------------------------
I've been inspecting this issue which results in frequent exceptions in the log
with something like:
{code}
2016-10-14 14:20:55,253 ERROR [sync.2] wal.FSHLog$SyncRunner(636): Error
syncing, request close of WAL
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1521)
at
org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1942)
at
org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1887)
at
org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130)
at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.sync(ProtobufLogWriter.java:79)
at
org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster$HighLatencySyncWriter.sync(TestLogRollingNoCluster.java:67)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SyncRunner.run(FSHLog.java:632)
at java.lang.Thread.run(Thread.java:745)
2016-10-14 14:20:55,253 INFO [95] wal.AbstractFSWAL(627): Rolled WAL
/user/enis/test-data/30328473-7c72-4729-b1a5-5bc085516dd5/WALs/org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster/org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster.1476480055179
with entries=6862, filesize=0 B; new WAL
/user/enis/test-data/30328473-7c72-4729-b1a5-5bc085516dd5/WALs/org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster/org.apache.hadoop.hbase.regionserver.wal.TestLogRollingNoCluster.1476480055220
2016-10-14 14:20:55,254 INFO [42] wal.TestLogRollingNoCluster$Appender(177):
Caught exception from Appender:42
java.nio.channels.ClosedChannelException
at
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1521)
{code}
which is also reported in
https://mail-archives.apache.org/mod_mbox/hbase-user/201603.mbox/%3c74ecffa8dc3b6847888649793c770fe0a2d67...@blreml510-mbs.china.huawei.com%3E.
Turns out the problem is that when we want to replace the WAL writer, we wait
for attaining a safe point between the LogRoller and the
RingBufferEventHandler. However, there is no coordination between the log
roller and the SyncRunner threads which can still call writer.sync(). This
results in the above exception on HDFS, and some already sync'ed requests to
raise exceptions back to the client (maybe a minor correctness issue for
non-idempotent operations).
I've modified TestLogRollingNoCluster by introducing an artificial delay, and I
can reproduce this every time.
> Make replacement of path the first operation during WAL rotation
> ----------------------------------------------------------------
>
> Key: HBASE-16824
> URL: https://issues.apache.org/jira/browse/HBASE-16824
> Project: HBase
> Issue Type: Bug
> Reporter: Atri Sharma
>
> In https://issues.apache.org/jira/browse/HBASE-12074, we hit an error if an
> async thread calls flush on a WAL record already closed as the WAL is being
> rotated. This JIRA investigates if setting the new WAL record path as the
> first operation during WAL rotation will fix the issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)