Ivan, sure :) Thank you for this contribution, merged to master.
вт, 27 мар. 2018 г. в 20:08, Ivan Rakov <ivan.glu...@gmail.com>: > Dmitry, > > Firstly PR contained dirty fix for performance measurement, but now it > contains good fix. :) Sorry for inconvenience. > I've renamed the PR. > > Best Regards, > Ivan Rakov > > On 27.03.2018 19:40, Dmitry Pavlov wrote: > > Hi Eduard, thank you for review. > > > > Hi Ivan, > > > > I'm confused on PR naming > > https://github.com/apache/ignite/pull/3656 > > > > Could you rename? > > > > Sincerely, > > Dmitriy Pavlov > > > > вт, 27 мар. 2018 г. в 19:38, Eduard Shangareev < > eduard.shangar...@gmail.com > >> : > >> Ivan, I have reviewed your changes, looks good. > >> > >> On Tue, Mar 27, 2018 at 2:56 PM, Ivan Rakov <ivan.glu...@gmail.com> > wrote: > >> > >>> Igniters, > >>> > >>> I've completed development of https://issues.apache.org/jira > >>> /browse/IGNITE-7754. TeamCity state is ok. Please, review my changes. > >>> Please note that it will be possible to track time of WAL fsync on > >>> checkpoint begin by *walCpRecordFsyncDuration *metric in "Checkpoint > >>> started" message. > >>> > >>> Also, I've created https://issues.apache.org/jira/browse/IGNITE-8057 > >> with > >>> description of possible further improvement of WAL fsync on checkpoint > >>> begin. > >>> > >>> Best Regards, > >>> Ivan Rakov > >>> > >>> > >>> On 26.03.2018 23:45, Valentin Kulichenko wrote: > >>> > >>>> Ivan, > >>>> > >>>> It's all good then :) Thanks! > >>>> > >>>> -Val > >>>> > >>>> On Mon, Mar 26, 2018 at 1:50 AM, Ivan Rakov <ivan.glu...@gmail.com> > >>>> wrote: > >>>> > >>>> Val, > >>>>> There's no any sense to use WalMode.NONE in production environment, > >> it's > >>>>> kept for testing and debugging purposes (including possible user > >>>>> activities > >>>>> like capacity planning). > >>>>> We already print a warning at node start in case WalMode.NONE is set: > >>>>> > >>>>> U.quietAndWarn(log,"Started write-ahead log manager in NONE mode, > >>>>> > >>>>>> persisted data may be lost in " + > >>>>>> "a case of unexpected node failure. Make sure to deactivate > the > >>>>>> cluster before shutdown."); > >>>>>> > >>>>>> Best Regards, > >>>>> Ivan Rakov > >>>>> > >>>>> > >>>>> On 24.03.2018 1:40, Valentin Kulichenko wrote: > >>>>> > >>>>> Dmitry, > >>>>>> Thanks for clarification. So it sounds like if we fix all other > modes > >> as > >>>>>> we > >>>>>> discuss here, NONE would be the only one allowing corruption. I also > >>>>>> don't > >>>>>> see much sense in this and I think we should clearly state this in > the > >>>>>> doc, > >>>>>> as well print out a warning if NONE mode is used. Eventually, if > it's > >>>>>> confirmed that there are no reasonable use cases for it, we can > >>>>>> deprecate > >>>>>> it. > >>>>>> > >>>>>> -Val > >>>>>> > >>>>>> On Fri, Mar 23, 2018 at 3:26 PM, Dmitry Pavlov < > dpavlov....@gmail.com > >>>>>> wrote: > >>>>>> > >>>>>> Hi Val, > >>>>>> > >>>>>>> NONE means that the WAL log is disabled and not written at all. Use > >> of > >>>>>>> the > >>>>>>> mode is at your own risk. It is possible that restore state after > the > >>>>>>> crash > >>>>>>> at the middle of checkpoint will not succeed. I do not see much > sence > >>>>>>> in > >>>>>>> it, especially in production. > >>>>>>> > >>>>>>> BACKGROUND is full functional WAL mode, but allows some delay > before > >>>>>>> flush > >>>>>>> to disk. > >>>>>>> > >>>>>>> Sincerely, > >>>>>>> Dmitriy Pavlov > >>>>>>> > >>>>>>> сб, 24 мар. 2018 г. в 1:07, Valentin Kulichenko < > >>>>>>> valentin.kuliche...@gmail.com>: > >>>>>>> > >>>>>>> I agree. In my view, any possibility to get a corrupted storage is > a > >>>>>>> bug > >>>>>>> > >>>>>>>> which needs to be fixed. > >>>>>>>> > >>>>>>>> BTW, can someone explain semantics of NONE mode? What is the > >>>>>>>> difference > >>>>>>>> from BACKGROUND from user's perspective? Is there any particular > use > >>>>>>>> case > >>>>>>>> where it can be used? > >>>>>>>> > >>>>>>>> -Val > >>>>>>>> > >>>>>>>> On Fri, Mar 23, 2018 at 2:49 AM, Dmitry Pavlov < > >> dpavlov....@gmail.com > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi Ivan, > >>>>>>>> > >>>>>>>>> IMO we have to add extra FSYNCS for BACKGROUND WAL. Agree? > >>>>>>>>> > >>>>>>>>> Sincerely, > >>>>>>>>> Dmitriy Pavlov > >>>>>>>>> > >>>>>>>>> пт, 23 мар. 2018 г. в 12:23, Ivan Rakov <ivan.glu...@gmail.com>: > >>>>>>>>> > >>>>>>>>> Igniters, there's another important question about this matter. > >>>>>>>>> > >>>>>>>>>> Do we want to add extra FSYNCS for BACKGROUND WAL mode? I think > >> that > >>>>>>>>>> we > >>>>>>>> have to do it: it will cause similar performance drop, but if we > >>>>>>>> > >>>>>>>>> consider LOG_ONLY broken without these fixes, BACKGROUND is > broken > >> as > >>>>>>>>>> well. > >>>>>>>>> Best Regards, > >>>>>>>>>> Ivan Rakov > >>>>>>>>>> > >>>>>>>>>> On 23.03.2018 10:27, Ivan Rakov wrote: > >>>>>>>>>> > >>>>>>>>>> Fixes are quite simple. > >>>>>>>>>>> I expect them to be merged in master in a week in worst case. > >>>>>>>>>>> > >>>>>>>>>>> Best Regards, > >>>>>>>>>>> Ivan Rakov > >>>>>>>>>>> > >>>>>>>>>>> On 22.03.2018 17:49, Denis Magda wrote: > >>>>>>>>>>> > >>>>>>>>>>> Ivan, > >>>>>>>>>>>> How quick are you going to merge the fix into the master? Many > >>>>>>>>>>>> persistence > >>>>>>>>>>>> related optimizations have already stacked up. Probably, we > can > >>>>>>>>>>>> > >>>>>>>>>>>> release > >>>>>>>>>> them sooner if the community agrees. > >>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>>> Denis > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Mar 22, 2018 at 5:22 AM, Ivan Rakov < > >>>>>>>>>>>> > >>>>>>>>>>>> ivan.glu...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>> Thanks all! > >>>>>>>>>>>>> We seem to have reached a consensus on this issue. I'll just > >> add > >>>>>>>>>>>>> necessary > >>>>>>>>>>>>> fsyncs under IGNITE-7754. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 22.03.2018 15:13, Ilya Lantukh wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> +1 for fixing LOG_ONLY. If current implementation doesn't > >>>>>>>>>>>>> protect > >>>>>>>>>>>>> > >>>>>>>>>>>> from > >>>>>>>>> data > >>>>>>>>>>> corruption, it doesn't make sence. > >>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 10:38 PM, Denis Magda < > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> dma...@apache.org> > >>>>>>>>>>>> wrote: > >>>>>>>>> +1 for the fix of LOG_ONLY > >>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:23 AM, Alexey Goncharuk < > >>>>>>>>>>>>>>> alexey.goncha...@gmail.com> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> +1 for fixing LOG_ONLY to enforce corruption safety given > the > >>>>>>>>>>>>>>> provided > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> performance results. > >>>>>>>>>>>>>>>> 2018-03-21 18:20 GMT+03:00 Vladimir Ozerov < > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> voze...@gridgain.com > >>>>>>>>>>>>>> : > >>>>>>>>>> +1 for accepting drop in LOG_ONLY. 7% is not that much and > >>>>>>>>>>>> not a > >>>>>>>>>>>>>> drop > >>>>>>>>> at > >>>>>>>>>>>>>>>> all, provided that we fixing a bug. I.e. should we > implement > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> it > >>>>>>>>>>>>>> correctly > >>>>>>>>> in the first place we would never notice any "drop". > >>>>>>>>>>>>>>>> I do not understand why someone would like to use current > >>>>>>>>>>>>>>>>> broken > >>>>>>>>>>>>>>> mode. > >>>>>>>>>> On Wed, Mar 21, 2018 at 6:11 PM, Dmitry Pavlov > >>>>>>>>>>>>>>>>> <dpavlov....@gmail.com> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi, I think option 1 is better. As Val said any mode that > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> allows > >>>>>>>>>>>>>>> corruption > >>>>>>>>>> does not make much sense. > >>>>>>>>>>>>>>>>>> What Ivan mentioned here as drop, in relation to old > mode > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> DEFAULT > >>>>>>>>>>>>>>>> (FSYNC > >>>>>>>>>>> now), is still significant perfromance boost. > >>>>>>>>>>>>>>>>> Sincerely, > >>>>>>>>>>>>>>>>>> Dmitriy Pavlov > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ср, 21 мар. 2018 г. в 17:56, Ivan Rakov < > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com > >>>>>>>>>>>>>>>> : > >>>>>>>>>> I've attached benchmark results to the JIRA ticket. > >>>>>>>>>>>> We observe ~7% drop in "fair" LOG_ONLY_SAFE mode, > >>>>>>>>>>>>>>>>>>> independent > >>>>>>>>>>>>>>>>> of > >>>>>>>>> WAL > >>>>>>>>>>> compaction enabled flag. It's pretty significant drop: WAL > >>>>>>>>>>>>>>>>> compaction > >>>>>>>>>>>>>>>>> itself gives only ~3% drop. > >>>>>>>>>>>>>>>>> I see two options here: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> 1) Change LOG_ONLY behavior. That implies that we'll be > >>>>>>>>>>>>>>>>>>> ready > >>>>>>>>>>>>>>>>> to > >>>>>>>>> release > >>>>>>>>>>> AI 2.5 with 7% drop. > >>>>>>>>>>>>>>>>>> 2) Introduce LOG_ONLY_SAFE, make it default, add release > >>>>>>>>>>>>>>>>>>> note > >>>>>>>>>>>>>>>>> to AI > >>>>>>>>> 2.5 > >>>>>>>>>>>>>>>>>> that we added power loss durability in default mode, but > >>>>>>>>>>>>>>>>> user > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> may > >>>>>>>>>>>>>>> fallback to previous LOG_ONLY in order to retain > >>>>>>>>>> performance. > >>>>>>>>>>>>>>>>> Thoughts? > >>>>>>>>> Best Regards, > >>>>>>>>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 20.03.2018 16:00, Ivan Rakov wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Val, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> If a storage is in > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be > >>>>>>>>>>>>>>>>>>>>> completely > >>>>>>>>>>>>>>>>>>> removed > >>>>>>>>>> and > >>>>>>>>>>>>>>>>>> cluster needs to be restarted without data? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Yes, there's a chance that in LOG_ONLY all local data > >> will > >>>>>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>>> lost, > >>>>>>>>>> but only in *power loss**/ OS crash* case. > >>>>>>>>>>>>>>>>> kill -9, JVM crash, death of critical system thread and > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> all > >>>>>>>>>>>>>>>>>> other > >>>>>>>>> cases that usually take place are variations of *process > >>>>>>>>>>>>>>>>>>>> crash*. > >>>>>>>>>>>>>>>>>> All > >>>>>>>>>>> WAL modes (except NONE, of course) ensure corruption-safety > >>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>> case > >>>>>>>>> of > >>>>>>>>>>>>>>>>> process crash. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> If so, I'm not sure any mode > >>>>>>>>>>>>>>>>>>>> that allows corruption makes much sense to me. > >>>>>>>>>>>>>>>>>>>>> It depends on performance impact of enforcing > >> power-loss > >>>>>>>>>>>>>>>>>>>> corruption > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> safety. Price of full protection from power loss is > high > >> - > >>>>>>>>>>>>>>>>> FSYNC > >>>>>>>>>>>>>> is > >>>>>>>>> way slower (2-10 times) than other WAL modes. The question is > >>>>>>>>>>>>>>>>> whether > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ensuring weaker guarantees (corruption can't happen, but > >>>>>>>>>>>>>>>>>> loss > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> last > >>>>>>>>>> updates can) will affect performance as badly as strong > >>>>>>>>>>>>>>>>>> guarantees. > >>>>>>>>>>>>>>>>>> I'll share benchmark results soon. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>>>>>>>>> On 20.03.2018 5:09, Valentin Kulichenko wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Guys, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> What do we understand under "data corruption" here? > If > >> a > >>>>>>>>>>>>>>>>>>>>> storage > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be > >> completely > >>>>>>>>>>>>>>>>>> removed > >>>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> cluster needs to be restarted without data? If so, I'm > not > >>>>>>>>>>>>>>>>>> sure > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> any > >>>>>>>>>> mode > >>>>>>>>>>>>>>>>>> that allows corruption makes much sense to me. How am I > >>>>>>>>>>>>>>>>>> supposed > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>> use a > >>>>>>>>>>>>>>>>>> database, if virtually any failure can end with complete > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> loss of > >>>>>>>>>>>>>>>>>>>>> data? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> In any case, this definitely should not be a default > >>>>>>>>>>>>>>>>>> behavior. > >>>>>>>>>>>>>>>> If > >>>>>>>>> user ever > >>>>>>>>>>>>>>>>>> switches to corruption-unsafe mode, there should be a > >>>>>>>>>>>>>>>>>> clear > >>>>>>>>>>>>>>>>>>> warning > >>>>>>>>> about > >>>>>>>>>>>>>>>>>> this. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> -Val > >>>>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 1:06 AM, Ivan Rakov < > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> Ticket to track changes: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-7754 > >>>>>>>>>>>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On 16.03.2018 10:58, Dmitriy Setrakyan wrote: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:55 AM, Ivan Rakov < > >>>>>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> Vladimir, > >>>>>>>>>>>>>>>>>>>> Unlike BACKGROUND, LOG_ONLY provides strict write > >>>>>>>>>>>>>>>>>>>>>>> guarantees > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> unless power > >>>>>>>>>>> loss has happened. > >>>>>>>>>>>>>>>>>>>>>>>> Seems like we need to measure performance > difference > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>>>>> decide > >>>>>>>>> whether do > >>>>>>>>>>>>>>>>>>>>> we need separate WAL mode. If it will be invisible, > >>>>>>>>>>>>>>>>>> we'll > >>>>>>>>>>>>>>>>>>>>>> just > >>>>>>>>>> fix > >>>>>>>>>>>>>>>>>>>>> these > >>>>>>>>>>>>>>>>>> bugs without introducing new mode; if it will be > >>>>>>>>>>>>>>>>>>>> perceptible, > >>>>>>>>>>>>>>>>>>>>>>>> we'll > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> continue the discussion about introducing > >>>>>>>>>>>>>>>>>>>>>> LOG_ONLY_SAFE. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Makes sense? > >>>>>>>>>>>>>>>>>>>> Yes, this sounds like the right approach. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >