Igniters,
I've completed development of
https://issues.apache.org/jira/browse/IGNITE-7754. TeamCity state is ok.
Please, review my changes.
Please note that it will be possible to track time of WAL fsync on
checkpoint begin by *walCpRecordFsyncDuration *metric in "Checkpoint
started" message.
Also, I've created https://issues.apache.org/jira/browse/IGNITE-8057
with description of possible further improvement of WAL fsync on
checkpoint begin.
Best Regards,
Ivan Rakov
On 26.03.2018 23:45, Valentin Kulichenko wrote:
Ivan,
It's all good then :) Thanks!
-Val
On Mon, Mar 26, 2018 at 1:50 AM, Ivan Rakov <ivan.glu...@gmail.com> wrote:
Val,
There's no any sense to use WalMode.NONE in production environment, it's
kept for testing and debugging purposes (including possible user activities
like capacity planning).
We already print a warning at node start in case WalMode.NONE is set:
U.quietAndWarn(log,"Started write-ahead log manager in NONE mode,
persisted data may be lost in " +
"a case of unexpected node failure. Make sure to deactivate the
cluster before shutdown.");
Best Regards,
Ivan Rakov
On 24.03.2018 1:40, Valentin Kulichenko wrote:
Dmitry,
Thanks for clarification. So it sounds like if we fix all other modes as
we
discuss here, NONE would be the only one allowing corruption. I also don't
see much sense in this and I think we should clearly state this in the
doc,
as well print out a warning if NONE mode is used. Eventually, if it's
confirmed that there are no reasonable use cases for it, we can deprecate
it.
-Val
On Fri, Mar 23, 2018 at 3:26 PM, Dmitry Pavlov <dpavlov....@gmail.com>
wrote:
Hi Val,
NONE means that the WAL log is disabled and not written at all. Use of
the
mode is at your own risk. It is possible that restore state after the
crash
at the middle of checkpoint will not succeed. I do not see much sence in
it, especially in production.
BACKGROUND is full functional WAL mode, but allows some delay before
flush
to disk.
Sincerely,
Dmitriy Pavlov
сб, 24 мар. 2018 г. в 1:07, Valentin Kulichenko <
valentin.kuliche...@gmail.com>:
I agree. In my view, any possibility to get a corrupted storage is a bug
which needs to be fixed.
BTW, can someone explain semantics of NONE mode? What is the difference
from BACKGROUND from user's perspective? Is there any particular use
case
where it can be used?
-Val
On Fri, Mar 23, 2018 at 2:49 AM, Dmitry Pavlov <dpavlov....@gmail.com>
wrote:
Hi Ivan,
IMO we have to add extra FSYNCS for BACKGROUND WAL. Agree?
Sincerely,
Dmitriy Pavlov
пт, 23 мар. 2018 г. в 12:23, Ivan Rakov <ivan.glu...@gmail.com>:
Igniters, there's another important question about this matter.
Do we want to add extra FSYNCS for BACKGROUND WAL mode? I think that
we
have to do it: it will cause similar performance drop, but if we
consider LOG_ONLY broken without these fixes, BACKGROUND is broken as
well.
Best Regards,
Ivan Rakov
On 23.03.2018 10:27, Ivan Rakov wrote:
Fixes are quite simple.
I expect them to be merged in master in a week in worst case.
Best Regards,
Ivan Rakov
On 22.03.2018 17:49, Denis Magda wrote:
Ivan,
How quick are you going to merge the fix into the master? Many
persistence
related optimizations have already stacked up. Probably, we can
release
them sooner if the community agrees.
--
Denis
On Thu, Mar 22, 2018 at 5:22 AM, Ivan Rakov <
ivan.glu...@gmail.com>
wrote:
Thanks all!
We seem to have reached a consensus on this issue. I'll just add
necessary
fsyncs under IGNITE-7754.
Best Regards,
Ivan Rakov
On 22.03.2018 15:13, Ilya Lantukh wrote:
+1 for fixing LOG_ONLY. If current implementation doesn't
protect
from
data
corruption, it doesn't make sence.
On Wed, Mar 21, 2018 at 10:38 PM, Denis Magda <
dma...@apache.org>
wrote:
+1 for the fix of LOG_ONLY
On Wed, Mar 21, 2018 at 11:23 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:
+1 for fixing LOG_ONLY to enforce corruption safety given the
provided
performance results.
2018-03-21 18:20 GMT+03:00 Vladimir Ozerov <
voze...@gridgain.com
:
+1 for accepting drop in LOG_ONLY. 7% is not that much and
not a
drop
at
all, provided that we fixing a bug. I.e. should we implement
it
correctly
in the first place we would never notice any "drop".
I do not understand why someone would like to use current
broken
mode.
On Wed, Mar 21, 2018 at 6:11 PM, Dmitry Pavlov
<dpavlov....@gmail.com>
wrote:
Hi, I think option 1 is better. As Val said any mode that
allows
corruption
does not make much sense.
What Ivan mentioned here as drop, in relation to old mode
DEFAULT
(FSYNC
now), is still significant perfromance boost.
Sincerely,
Dmitriy Pavlov
ср, 21 мар. 2018 г. в 17:56, Ivan Rakov <
ivan.glu...@gmail.com
:
I've attached benchmark results to the JIRA ticket.
We observe ~7% drop in "fair" LOG_ONLY_SAFE mode,
independent
of
WAL
compaction enabled flag. It's pretty significant drop: WAL
compaction
itself gives only ~3% drop.
I see two options here:
1) Change LOG_ONLY behavior. That implies that we'll be
ready
to
release
AI 2.5 with 7% drop.
2) Introduce LOG_ONLY_SAFE, make it default, add release
note
to AI
2.5
that we added power loss durability in default mode, but user
may
fallback to previous LOG_ONLY in order to retain
performance.
Thoughts?
Best Regards,
Ivan Rakov
On 20.03.2018 16:00, Ivan Rakov wrote:
Val,
If a storage is in
corrupted state, does it mean that it needs to be
completely
removed
and
cluster needs to be restarted without data?
Yes, there's a chance that in LOG_ONLY all local data will
be
lost,
but only in *power loss**/ OS crash* case.
kill -9, JVM crash, death of critical system thread and
all
other
cases that usually take place are variations of *process
crash*.
All
WAL modes (except NONE, of course) ensure corruption-safety
in
case
of
process crash.
If so, I'm not sure any mode
that allows corruption makes much sense to me.
It depends on performance impact of enforcing power-loss
corruption
safety. Price of full protection from power loss is high -
FSYNC
is
way slower (2-10 times) than other WAL modes. The question is
whether
ensuring weaker guarantees (corruption can't happen, but loss
of
last
updates can) will affect performance as badly as strong
guarantees.
I'll share benchmark results soon.
Best Regards,
Ivan Rakov
On 20.03.2018 5:09, Valentin Kulichenko wrote:
Guys,
What do we understand under "data corruption" here? If a
storage
is
in
corrupted state, does it mean that it needs to be completely
removed
and
cluster needs to be restarted without data? If so, I'm not
sure
any
mode
that allows corruption makes much sense to me. How am I
supposed
to
use a
database, if virtually any failure can end with complete
loss of
data?
In any case, this definitely should not be a default
behavior.
If
user ever
switches to corruption-unsafe mode, there should be a
clear
warning
about
this.
-Val
On Fri, Mar 16, 2018 at 1:06 AM, Ivan Rakov <
ivan.glu...@gmail.com>
wrote:
Ticket to track changes:
https://issues.apache.org/jira/browse/IGNITE-7754
Best Regards,
Ivan Rakov
On 16.03.2018 10:58, Dmitriy Setrakyan wrote:
On Fri, Mar 16, 2018 at 12:55 AM, Ivan Rakov <
ivan.glu...@gmail.com
wrote:
Vladimir,
Unlike BACKGROUND, LOG_ONLY provides strict write
guarantees
unless power
loss has happened.
Seems like we need to measure performance difference
to
decide
whether do
we need separate WAL mode. If it will be invisible,
we'll
just
fix
these
bugs without introducing new mode; if it will be
perceptible,
we'll
continue the discussion about introducing LOG_ONLY_SAFE.
Makes sense?
Yes, this sounds like the right approach.