Re: Reconsider default WAL mode: we need something between LOG_ONLY and FSYNC

Valentin Kulichenko Mon, 26 Mar 2018 13:46:12 -0700

Ivan,

It's all good then :) Thanks!


-Val

On Mon, Mar 26, 2018 at 1:50 AM, Ivan Rakov <ivan.glu...@gmail.com> wrote:

> Val,
>
> There's no any sense to use WalMode.NONE in production environment, it's
> kept for testing and debugging purposes (including possible user activities
> like capacity planning).
> We already print a warning at node start in case WalMode.NONE is set:
>
> U.quietAndWarn(log,"Started write-ahead log manager in NONE mode,
>> persisted data may be lost in " +
>>      "a case of unexpected node failure. Make sure to deactivate the
>> cluster before shutdown.");
>>
>
> Best Regards,
> Ivan Rakov
>
>
> On 24.03.2018 1:40, Valentin Kulichenko wrote:
>
>> Dmitry,
>>
>> Thanks for clarification. So it sounds like if we fix all other modes as
>> we
>> discuss here, NONE would be the only one allowing corruption. I also don't
>> see much sense in this and I think we should clearly state this in the
>> doc,
>> as well print out a warning if NONE mode is used. Eventually, if it's
>> confirmed that there are no reasonable use cases for it, we can deprecate
>> it.
>>
>> -Val
>>
>> On Fri, Mar 23, 2018 at 3:26 PM, Dmitry Pavlov <dpavlov....@gmail.com>
>> wrote:
>>
>> Hi Val,
>>>
>>> NONE means that the WAL log is disabled and not written at all. Use of
>>> the
>>> mode is at your own risk. It is possible that restore state after the
>>> crash
>>> at the middle of checkpoint will not succeed. I do not see much sence in
>>> it, especially in production.
>>>
>>> BACKGROUND is full functional WAL mode, but allows some delay before
>>> flush
>>> to disk.
>>>
>>> Sincerely,
>>> Dmitriy Pavlov
>>>
>>> сб, 24 мар. 2018 г. в 1:07, Valentin Kulichenko <
>>> valentin.kuliche...@gmail.com>:
>>>
>>> I agree. In my view, any possibility to get a corrupted storage is a bug
>>>> which needs to be fixed.
>>>>
>>>> BTW, can someone explain semantics of NONE mode? What is the difference
>>>> from BACKGROUND from user's perspective? Is there any particular use
>>>> case
>>>> where it can be used?
>>>>
>>>> -Val
>>>>
>>>> On Fri, Mar 23, 2018 at 2:49 AM, Dmitry Pavlov <dpavlov....@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Ivan,
>>>>>
>>>>> IMO we have to add extra FSYNCS for BACKGROUND WAL. Agree?
>>>>>
>>>>> Sincerely,
>>>>> Dmitriy Pavlov
>>>>>
>>>>> пт, 23 мар. 2018 г. в 12:23, Ivan Rakov <ivan.glu...@gmail.com>:
>>>>>
>>>>> Igniters, there's another important question about this matter.
>>>>>> Do we want to add extra FSYNCS for BACKGROUND WAL mode? I think that
>>>>>>
>>>>> we
>>>
>>>> have to do it: it will cause similar performance drop, but if we
>>>>>> consider LOG_ONLY broken without these fixes, BACKGROUND is broken as
>>>>>>
>>>>> well.
>>>>>
>>>>>> Best Regards,
>>>>>> Ivan Rakov
>>>>>>
>>>>>> On 23.03.2018 10:27, Ivan Rakov wrote:
>>>>>>
>>>>>>> Fixes are quite simple.
>>>>>>> I expect them to be merged in master in a week in worst case.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Ivan Rakov
>>>>>>>
>>>>>>> On 22.03.2018 17:49, Denis Magda wrote:
>>>>>>>
>>>>>>>> Ivan,
>>>>>>>>
>>>>>>>> How quick are you going to merge the fix into the master? Many
>>>>>>>> persistence
>>>>>>>> related optimizations have already stacked up. Probably, we can
>>>>>>>>
>>>>>>> release
>>>>>
>>>>>> them sooner if the community agrees.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Denis
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2018 at 5:22 AM, Ivan Rakov <
>>>>>>>>
>>>>>>> ivan.glu...@gmail.com>
>>>
>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks all!
>>>>>>>>> We seem to have reached a consensus on this issue. I'll just add
>>>>>>>>> necessary
>>>>>>>>> fsyncs under IGNITE-7754.
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>> Ivan Rakov
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 22.03.2018 15:13, Ilya Lantukh wrote:
>>>>>>>>>
>>>>>>>>> +1 for fixing LOG_ONLY. If current implementation doesn't
>>>>>>>>>>
>>>>>>>>> protect
>>>
>>>> from
>>>>>
>>>>>> data
>>>>>>>>>> corruption, it doesn't make sence.
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2018 at 10:38 PM, Denis Magda <
>>>>>>>>>>
>>>>>>>>> dma...@apache.org>
>>>
>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> +1 for the fix of LOG_ONLY
>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:23 AM, Alexey Goncharuk <
>>>>>>>>>>> alexey.goncha...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> +1 for fixing LOG_ONLY to enforce corruption safety given the
>>>>>>>>>>> provided
>>>>>>>>>>>
>>>>>>>>>>>> performance results.
>>>>>>>>>>>>
>>>>>>>>>>>> 2018-03-21 18:20 GMT+03:00 Vladimir Ozerov <
>>>>>>>>>>>>
>>>>>>>>>>> voze...@gridgain.com
>>>>
>>>>> :
>>>>>>
>>>>>>> +1 for accepting drop in LOG_ONLY. 7% is not that much and
>>>>>>>>>>>>
>>>>>>>>>>> not a
>>>
>>>> drop
>>>>>>>>>>>> at
>>>>>>>>>>>> all, provided that we fixing a bug. I.e. should we implement
>>>>>>>>>>>>
>>>>>>>>>>> it
>>>
>>>> correctly
>>>>>>>>>>>> in the first place we would never notice any "drop".
>>>>>>>>>>>>
>>>>>>>>>>>>> I do not understand why someone would like to use current
>>>>>>>>>>>>>
>>>>>>>>>>>> broken
>>>>
>>>>> mode.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 6:11 PM, Dmitry Pavlov
>>>>>>>>>>>>> <dpavlov....@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, I think option 1 is better. As Val said any mode that
>>>>>>>>>>>>>
>>>>>>>>>>>> allows
>>>>
>>>>> corruption
>>>>>>>>>>>>>
>>>>>>>>>>>>> does not make much sense.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What Ivan mentioned here as drop, in relation to old mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>> DEFAULT
>>>>>
>>>>>> (FSYNC
>>>>>>>>>>>>> now), is still significant perfromance boost.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sincerely,
>>>>>>>>>>>>>> Dmitriy Pavlov
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ср, 21 мар. 2018 г. в 17:56, Ivan Rakov <
>>>>>>>>>>>>>>
>>>>>>>>>>>>> ivan.glu...@gmail.com
>>>>
>>>>> :
>>>>>>
>>>>>>> I've attached benchmark results to the JIRA ticket.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We observe ~7% drop in "fair" LOG_ONLY_SAFE mode,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> independent
>>>
>>>> of
>>>>>
>>>>>> WAL
>>>>>>>>>>>>>>
>>>>>>>>>>>>> compaction enabled flag. It's pretty significant drop: WAL
>>>>>>>>>>>>
>>>>>>>>>>>>> compaction
>>>>>>>>>>>>>>
>>>>>>>>>>>>> itself gives only ~3% drop.
>>>>>>>>>>>>
>>>>>>>>>>>>> I see two options here:
>>>>>>>>>>>>>>> 1) Change LOG_ONLY behavior. That implies that we'll be
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ready
>>>
>>>> to
>>>>>
>>>>>> release
>>>>>>>>>>>>>> AI 2.5 with 7% drop.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) Introduce LOG_ONLY_SAFE, make it default, add release
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> note
>>>
>>>> to AI
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2.5
>>>>>>>>>>>>>>
>>>>>>>>>>>>> that we added power loss durability in default mode, but user
>>>>>>>>>>>>>
>>>>>>>>>>>> may
>>>>
>>>>> fallback to previous LOG_ONLY in order to retain
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> performance.
>>>
>>>> Thoughts?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>> Ivan Rakov
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 20.03.2018 16:00, Ivan Rakov wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Val,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If a storage is in
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> completely
>>>>
>>>>> removed
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster needs to be restarted without data?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, there's a chance that in LOG_ONLY all local data will
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be
>>>>
>>>>> lost,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> but only in *power loss**/ OS crash* case.
>>>>>>>>>>>>
>>>>>>>>>>>>> kill -9, JVM crash, death of critical system thread and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> all
>>>
>>>> other
>>>>>>>>>>>>>>>> cases that usually take place are variations of *process
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> crash*.
>>>>>
>>>>>> All
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> WAL modes (except NONE, of course) ensure corruption-safety
>>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>
>>>> case
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> of
>>>>>>>>>>>>
>>>>>>>>>>>>> process crash.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If so, I'm not sure any mode
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> that allows corruption makes much sense to me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It depends on performance impact of enforcing power-loss
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> corruption
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> safety. Price of full protection from power loss is high -
>>>>>>>>>>>>
>>>>>>>>>>> FSYNC
>>>
>>>> is
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> way slower (2-10 times) than other WAL modes. The question is
>>>>>>>>>>>>
>>>>>>>>>>>>> whether
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ensuring weaker guarantees (corruption can't happen, but loss
>>>>>>>>>>>>>
>>>>>>>>>>>> of
>>>>
>>>>> last
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> updates can) will affect performance as badly as strong
>>>>>>>>>>>>>
>>>>>>>>>>>>>> guarantees.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll share benchmark results soon.
>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>> Ivan Rakov
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 20.03.2018 5:09, Valentin Kulichenko wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Guys,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What do we understand under "data corruption" here? If a
>>>>>>>>>>>>>>>>> storage
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>
>>>>>>>>>>>>> corrupted state, does it mean that it needs to be completely
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> removed
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster needs to be restarted without data? If so, I'm not
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sure
>>>>
>>>>> any
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> mode
>>>>>>>>>>>>>
>>>>>>>>>>>>>> that allows corruption makes much sense to me. How am I
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> supposed
>>>>>
>>>>>> to
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> use a
>>>>>>>>>>>>>
>>>>>>>>>>>>>> database, if virtually any failure can end with complete
>>>>>>>>>>>>>>>>> loss of
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> data?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In any case, this definitely should not be a default
>>>>>>>>>>>>>>
>>>>>>>>>>>>> behavior.
>>>
>>>> If
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> user ever
>>>>>>>>>>>>
>>>>>>>>>>>>> switches to corruption-unsafe mode, there should be a
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> clear
>>>
>>>> warning
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>
>>>>>>>>>>>>>> this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Val
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 1:06 AM, Ivan Rakov <
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ticket to track changes:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-7754
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>> Ivan Rakov
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 16.03.2018 10:58, Dmitriy Setrakyan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:55 AM, Ivan Rakov <
>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Vladimir,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Unlike BACKGROUND, LOG_ONLY provides strict write
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> guarantees
>>>>>
>>>>>> unless power
>>>>>>>>>>>>>>>>>>>> loss has happened.
>>>>>>>>>>>>>>>>>>>> Seems like we need to measure performance difference
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> to
>>>
>>>> decide
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> whether do
>>>>>>>>>>>>
>>>>>>>>>>>>> we need separate WAL mode. If it will be invisible,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> we'll
>>>>
>>>>> just
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> fix
>>>>>>>>>>>>
>>>>>>>>>>>>> these
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> bugs without introducing new mode; if it will be
>>>>>>>>>>>>>>>>>>>> perceptible,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> we'll
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> continue the discussion about introducing LOG_ONLY_SAFE.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Makes sense?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, this sounds like the right approach.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>
>

Re: Reconsider default WAL mode: we need something between LOG_ONLY and FSYNC

Reply via email to