Dmitriy Govorukhin created IGNITE-12127:
-------------------------------------------
Summary: WAL writer may close file IO with unflushed changes when
MMAP is disabled
Key: IGNITE-12127
URL: https://issues.apache.org/jira/browse/IGNITE-12127
Project: Ignite
Issue Type: Bug
Reporter: Dmitriy Govorukhin
Assignee: Dmitriy Govorukhin
Most likely the issue manifests itself as the following critical error:
{code}
2019-08-27 14:52:31.286 ERROR 26835 --- [wal-write-worker%null-#447] ROOT :
Critical system error detected. Will be handled accordingly to configured
handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
o.a.i.i.processors.cache.persistence.StorageException: Failed to write buffer.]]
org.apache.ignite.internal.processors.cache.persistence.StorageException:
Failed to write buffer.
at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3444)
[ignite-core-2.5.7.jar!/:2.5.7]
at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.body(FileWriteAheadLogManager.java:3249)
[ignite-core-2.5.7.jar!/:2.5.7]
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
[ignite-core-2.5.7.jar!/:2.5.7]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_201]
Caused by: java.nio.channels.ClosedChannelException: null
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
~[na:1.8.0_201]
at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:253)
~[na:1.8.0_201]
at
org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.position(RandomAccessFileIO.java:48)
~[ignite-core-2.5.7.jar!/:2.5.7]
at
org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator.position(FileIODecorator.java:41)
~[ignite-core-2.5.7.jar!/:2.5.7]
at
org.apache.ignite.internal.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:111)
~[ignite-core-2.5.7.jar!/:2.5.7]
at
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$WALWriter.writeBuffer(FileWriteAheadLogManager.java:3437)
[ignite-core-2.5.7.jar!/:2.5.7]
... 3 common frames omitted
{code}
It appears that there following sequence is possible:
* Thread A attempts to log a large record which does not fit segment,
{{addRecord}} fails and the thread A starts segment rollover. I successfully
runs {{flushOrWait(null)}} and gets de-scheduled before adding switch segment
record
* Thread B attempts to log another record, which fits exactly till the end of
the current segment. The record is added to the buffer
* Thread A resumes and fails to add the switch segment record. No flush is
performed and the thread immediately proceeds for wal-writer close
* WAL writer thread wakes up, sees that there is a CLOSE request, closes the
file IO and immediately proceeds to write unflushed changes causing the
exception.
Unconditional flush after switch segment record write should fix the issue.
Besides the bug itself, I suggest the following changes to the
{{FileWriteHandleImpl}} ({{FileWriteAheadLogManager}} in earlier versions):
* There is an {{fsync(filePtr)}} call inside {{close()}}; however, {{fsync()}}
checks the {{stop}} flag (which is set inside {{close}}) and returns
immediately after {{flushOrWait()}} if the flag is set - this is very
confusing. After all, the {{close()}} itself explicitly calls {{force}} after
flush
* There is an ignored IO exception in mmap mode - this should be propagated to
the failure handler
* In WAL writer, we check for file CLOSE and then attemp to write to
(possibly) the same write handle - write should be always before close
* In WAL writer, there are racy reads of current handle - it would be better
if we read the current handle once and then operate on it during the whole loop
iteration
--
This message was sent by Atlassian Jira
(v8.3.2#803003)