Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Created the second task by this discussion, IGNITE-14952. 17.06.2021, 14:26, "ткаленко кирилл" : > Created the first task by this discussion IGNITE-14923. > > 13.05.2021, 18:37, "Stanislav Lukyanov" : >> What I mean by degradation when archive size < min is that, for example, >> historical rebalance is available for a smaller timespan than expected by >> the system design. >> It may not be an issue of course, especially for a new cluster. If >> "degradation" is the wrong word we can call it "non-steady state" :) >> In any case, I think we're on the same page. >> >>> On 11 May 2021, at 13:18, Andrey Gura wrote: >>> >>> Stan >>> >>>> If archive size is less than min or more than max then the system >>>> functionality can degrade (e.g. historical rebalance may not work as >>>> expected). >>> >>> Why does the condition "archive size is less than min" lead to system >>> degradation? Actually, the described case is a normal situation for >>> brand new clusters. >>> >>> I'm okay with the proposed minWalArchiveSize property. Looks like >>> relatively understandable property. >>> >>> On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov >>> wrote: >>>> Discuss this with Kirill verbally. >>>> >>>> Kirill showed me that having the min threshold doesn't quite work. >>>> It doesn't work because we no longer know how much WAL we should remove >>>> if we reach getMaxWalArchiveSize. >>>> >>>> For example, say we have minWalArchiveTimespan=2 hours and >>>> maxWalArchiveSize=2GB. >>>> Say, under normal load on stable topology 2 hours of WAL use 1 GB of >>>> space. >>>> Now, say we're doing historical rebalance and reserve the WAL archive. >>>> The WAL archive starts growing and soon it occupies 2 GB. >>>> Now what? >>>> We're supposed to give up WAL reservations and start agressively >>>> removing WAL archive. >>>> But it is not clear when can we stop removing WAL archive - since last 2 >>>> hours of WAL are larger than our maxWalArchiveSize >>>> there is no meaningful point the system can use as a "minimum" WAL size. >>>> >>>> I understand the description above is a bit messy but I believe that >>>> whoever is interested in this will understand it >>>> after drawing this on paper. >>>> >>>> I'm giving up on my latest suggestion about time-based minimum. Let's >>>> keep it simple. >>>> >>>> I suggest the minWalArchiveSize and maxWalArchvieSize properties as the >>>> solution, >>>> with the behavior as initially described by Kirill. >>>> >>>> Stan >>>> >>>>> On 7 May 2021, at 15:09, ткаленко кирилл wrote: >>>>> >>>>> Stas hello! >>>>> >>>>> I didn't quite get your last idea. >>>>> What will we do if we reach getMaxWalArchiveSize? Shall we not delete >>>>> the segment until minWalArchiveTimespan? >>>>> >>>>> 06.05.2021, 20:00, "Stanislav Lukyanov" : >>>>>> An interesting suggestion I heard today. >>>>>> >>>>>> The minWalArchiveSize property might actually be minWalArchiveTimespan >>>>>> - i.e. be a number of seconds instead of a number of bytes! >>>>>> >>>>>> I think this makes perfect sense from the user point of view. >>>>>> "I want to have WAL archive for at least N hours but I have a limit of >>>>>> M gigabytes to store it". >>>>>> >>>>>> Do we have checkpoint timestamp stored anywhere? (cp start markers?) >>>>>> Perhaps we can actually implement this? >>>>>> >>>>>> Thanks, >>>>>> Stan >>>>>> >>>>>>> On 6 May 2021, at 14:13, Stanislav Lukyanov >>>>>>> wrote: >>>>>>> >>>>>>> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize >>>>>>> +1 to add a public property to replace >>>>>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE >>>>>>> >>>>>>> I don't like the name getWalArchiveS
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Created the first task by this discussion IGNITE-14923. 13.05.2021, 18:37, "Stanislav Lukyanov" : > What I mean by degradation when archive size < min is that, for example, > historical rebalance is available for a smaller timespan than expected by the > system design. > It may not be an issue of course, especially for a new cluster. If > "degradation" is the wrong word we can call it "non-steady state" :) > In any case, I think we're on the same page. > >> On 11 May 2021, at 13:18, Andrey Gura wrote: >> >> Stan >> >>> If archive size is less than min or more than max then the system >>> functionality can degrade (e.g. historical rebalance may not work as >>> expected). >> >> Why does the condition "archive size is less than min" lead to system >> degradation? Actually, the described case is a normal situation for >> brand new clusters. >> >> I'm okay with the proposed minWalArchiveSize property. Looks like >> relatively understandable property. >> >> On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov >> wrote: >>> Discuss this with Kirill verbally. >>> >>> Kirill showed me that having the min threshold doesn't quite work. >>> It doesn't work because we no longer know how much WAL we should remove if >>> we reach getMaxWalArchiveSize. >>> >>> For example, say we have minWalArchiveTimespan=2 hours and >>> maxWalArchiveSize=2GB. >>> Say, under normal load on stable topology 2 hours of WAL use 1 GB of space. >>> Now, say we're doing historical rebalance and reserve the WAL archive. >>> The WAL archive starts growing and soon it occupies 2 GB. >>> Now what? >>> We're supposed to give up WAL reservations and start agressively removing >>> WAL archive. >>> But it is not clear when can we stop removing WAL archive - since last 2 >>> hours of WAL are larger than our maxWalArchiveSize >>> there is no meaningful point the system can use as a "minimum" WAL size. >>> >>> I understand the description above is a bit messy but I believe that >>> whoever is interested in this will understand it >>> after drawing this on paper. >>> >>> I'm giving up on my latest suggestion about time-based minimum. Let's keep >>> it simple. >>> >>> I suggest the minWalArchiveSize and maxWalArchvieSize properties as the >>> solution, >>> with the behavior as initially described by Kirill. >>> >>> Stan >>> >>>> On 7 May 2021, at 15:09, ткаленко кирилл wrote: >>>> >>>> Stas hello! >>>> >>>> I didn't quite get your last idea. >>>> What will we do if we reach getMaxWalArchiveSize? Shall we not delete the >>>> segment until minWalArchiveTimespan? >>>> >>>> 06.05.2021, 20:00, "Stanislav Lukyanov" : >>>>> An interesting suggestion I heard today. >>>>> >>>>> The minWalArchiveSize property might actually be minWalArchiveTimespan - >>>>> i.e. be a number of seconds instead of a number of bytes! >>>>> >>>>> I think this makes perfect sense from the user point of view. >>>>> "I want to have WAL archive for at least N hours but I have a limit of M >>>>> gigabytes to store it". >>>>> >>>>> Do we have checkpoint timestamp stored anywhere? (cp start markers?) >>>>> Perhaps we can actually implement this? >>>>> >>>>> Thanks, >>>>> Stan >>>>> >>>>>> On 6 May 2021, at 14:13, Stanislav Lukyanov >>>>>> wrote: >>>>>> >>>>>> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize >>>>>> +1 to add a public property to replace >>>>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE >>>>>> >>>>>> I don't like the name getWalArchiveSize - I think it's a bit confusing >>>>>> (is it the current size? the minimal size? the target size?) >>>>>> I suggest to name the property geMintWalArchiveSize. I think that this >>>>>> is exactly what it is - the minimal size of the archive that we want to >>>>>> have. >>>>>> The archive size at all times should be between min and max. >>>>>> If archive size is less than min or more than max then the system >
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
What I mean by degradation when archive size < min is that, for example, historical rebalance is available for a smaller timespan than expected by the system design. It may not be an issue of course, especially for a new cluster. If "degradation" is the wrong word we can call it "non-steady state" :) In any case, I think we're on the same page. > On 11 May 2021, at 13:18, Andrey Gura wrote: > > Stan > >> If archive size is less than min or more than max then the system >> functionality can degrade (e.g. historical rebalance may not work as >> expected). > > Why does the condition "archive size is less than min" lead to system > degradation? Actually, the described case is a normal situation for > brand new clusters. > > I'm okay with the proposed minWalArchiveSize property. Looks like > relatively understandable property. > > On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov > wrote: >> >> Discuss this with Kirill verbally. >> >> Kirill showed me that having the min threshold doesn't quite work. >> It doesn't work because we no longer know how much WAL we should remove if >> we reach getMaxWalArchiveSize. >> >> For example, say we have minWalArchiveTimespan=2 hours and >> maxWalArchiveSize=2GB. >> Say, under normal load on stable topology 2 hours of WAL use 1 GB of space. >> Now, say we're doing historical rebalance and reserve the WAL archive. >> The WAL archive starts growing and soon it occupies 2 GB. >> Now what? >> We're supposed to give up WAL reservations and start agressively removing >> WAL archive. >> But it is not clear when can we stop removing WAL archive - since last 2 >> hours of WAL are larger than our maxWalArchiveSize >> there is no meaningful point the system can use as a "minimum" WAL size. >> >> I understand the description above is a bit messy but I believe that whoever >> is interested in this will understand it >> after drawing this on paper. >> >> >> I'm giving up on my latest suggestion about time-based minimum. Let's keep >> it simple. >> >> I suggest the minWalArchiveSize and maxWalArchvieSize properties as the >> solution, >> with the behavior as initially described by Kirill. >> >> Stan >> >> >>> On 7 May 2021, at 15:09, ткаленко кирилл wrote: >>> >>> Stas hello! >>> >>> I didn't quite get your last idea. >>> What will we do if we reach getMaxWalArchiveSize? Shall we not delete the >>> segment until minWalArchiveTimespan? >>> >>> 06.05.2021, 20:00, "Stanislav Lukyanov" : >>>> An interesting suggestion I heard today. >>>> >>>> The minWalArchiveSize property might actually be minWalArchiveTimespan - >>>> i.e. be a number of seconds instead of a number of bytes! >>>> >>>> I think this makes perfect sense from the user point of view. >>>> "I want to have WAL archive for at least N hours but I have a limit of M >>>> gigabytes to store it". >>>> >>>> Do we have checkpoint timestamp stored anywhere? (cp start markers?) >>>> Perhaps we can actually implement this? >>>> >>>> Thanks, >>>> Stan >>>> >>>>> On 6 May 2021, at 14:13, Stanislav Lukyanov >>>>> wrote: >>>>> >>>>> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize >>>>> +1 to add a public property to replace >>>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE >>>>> >>>>> I don't like the name getWalArchiveSize - I think it's a bit confusing >>>>> (is it the current size? the minimal size? the target size?) >>>>> I suggest to name the property geMintWalArchiveSize. I think that this is >>>>> exactly what it is - the minimal size of the archive that we want to have. >>>>> The archive size at all times should be between min and max. >>>>> If archive size is less than min or more than max then the system >>>>> functionality can degrade (e.g. historical rebalance may not work as >>>>> expected). >>>>> I think these rules are intuitively understood from the "min" and "max" >>>>> names. >>>>> >>>>> Ilya's suggestion about throttling is great although I'd do this in a >>>>> different ticket. >>>>> >>>>> Thanks, >>>>> Stan >
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Stan > If archive size is less than min or more than max then the system > functionality can degrade (e.g. historical rebalance may not work as > expected). Why does the condition "archive size is less than min" lead to system degradation? Actually, the described case is a normal situation for brand new clusters. I'm okay with the proposed minWalArchiveSize property. Looks like relatively understandable property. On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov wrote: > > Discuss this with Kirill verbally. > > Kirill showed me that having the min threshold doesn't quite work. > It doesn't work because we no longer know how much WAL we should remove if we > reach getMaxWalArchiveSize. > > For example, say we have minWalArchiveTimespan=2 hours and > maxWalArchiveSize=2GB. > Say, under normal load on stable topology 2 hours of WAL use 1 GB of space. > Now, say we're doing historical rebalance and reserve the WAL archive. > The WAL archive starts growing and soon it occupies 2 GB. > Now what? > We're supposed to give up WAL reservations and start agressively removing WAL > archive. > But it is not clear when can we stop removing WAL archive - since last 2 > hours of WAL are larger than our maxWalArchiveSize > there is no meaningful point the system can use as a "minimum" WAL size. > > I understand the description above is a bit messy but I believe that whoever > is interested in this will understand it > after drawing this on paper. > > > I'm giving up on my latest suggestion about time-based minimum. Let's keep it > simple. > > I suggest the minWalArchiveSize and maxWalArchvieSize properties as the > solution, > with the behavior as initially described by Kirill. > > Stan > > > > On 7 May 2021, at 15:09, ткаленко кирилл wrote: > > > > Stas hello! > > > > I didn't quite get your last idea. > > What will we do if we reach getMaxWalArchiveSize? Shall we not delete the > > segment until minWalArchiveTimespan? > > > > 06.05.2021, 20:00, "Stanislav Lukyanov" : > >> An interesting suggestion I heard today. > >> > >> The minWalArchiveSize property might actually be minWalArchiveTimespan - > >> i.e. be a number of seconds instead of a number of bytes! > >> > >> I think this makes perfect sense from the user point of view. > >> "I want to have WAL archive for at least N hours but I have a limit of M > >> gigabytes to store it". > >> > >> Do we have checkpoint timestamp stored anywhere? (cp start markers?) > >> Perhaps we can actually implement this? > >> > >> Thanks, > >> Stan > >> > >>> On 6 May 2021, at 14:13, Stanislav Lukyanov > >>> wrote: > >>> > >>> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize > >>> +1 to add a public property to replace > >>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE > >>> > >>> I don't like the name getWalArchiveSize - I think it's a bit confusing > >>> (is it the current size? the minimal size? the target size?) > >>> I suggest to name the property geMintWalArchiveSize. I think that this > >>> is exactly what it is - the minimal size of the archive that we want to > >>> have. > >>> The archive size at all times should be between min and max. > >>> If archive size is less than min or more than max then the system > >>> functionality can degrade (e.g. historical rebalance may not work as > >>> expected). > >>> I think these rules are intuitively understood from the "min" and "max" > >>> names. > >>> > >>> Ilya's suggestion about throttling is great although I'd do this in a > >>> different ticket. > >>> > >>> Thanks, > >>> Stan > >>> > >>>> On 5 May 2021, at 19:25, Maxim Muzafarov wrote: > >>>> > >>>> Hello, Kirill > >>>> > >>>> +1 for this change, however, there are too many configuration settings > >>>> that exist for the user to configure Ignite cluster. It is better to > >>>> keep the options that we already have and fix the behaviour of the > >>>> rebalance process as you suggested. > >>>> > >>>> On Tue, 4 May 2021 at 19:01, ткаленко кирилл > >>>> wrote: > >>>>> Hi Ilya! > >>>>> > >>>>> Then we can greatly reduce the user load on the clu
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Discuss this with Kirill verbally. Kirill showed me that having the min threshold doesn't quite work. It doesn't work because we no longer know how much WAL we should remove if we reach getMaxWalArchiveSize. For example, say we have minWalArchiveTimespan=2 hours and maxWalArchiveSize=2GB. Say, under normal load on stable topology 2 hours of WAL use 1 GB of space. Now, say we're doing historical rebalance and reserve the WAL archive. The WAL archive starts growing and soon it occupies 2 GB. Now what? We're supposed to give up WAL reservations and start agressively removing WAL archive. But it is not clear when can we stop removing WAL archive - since last 2 hours of WAL are larger than our maxWalArchiveSize there is no meaningful point the system can use as a "minimum" WAL size. I understand the description above is a bit messy but I believe that whoever is interested in this will understand it after drawing this on paper. I'm giving up on my latest suggestion about time-based minimum. Let's keep it simple. I suggest the minWalArchiveSize and maxWalArchvieSize properties as the solution, with the behavior as initially described by Kirill. Stan > On 7 May 2021, at 15:09, ткаленко кирилл wrote: > > Stas hello! > > I didn't quite get your last idea. > What will we do if we reach getMaxWalArchiveSize? Shall we not delete the > segment until minWalArchiveTimespan? > > 06.05.2021, 20:00, "Stanislav Lukyanov" : >> An interesting suggestion I heard today. >> >> The minWalArchiveSize property might actually be minWalArchiveTimespan - >> i.e. be a number of seconds instead of a number of bytes! >> >> I think this makes perfect sense from the user point of view. >> "I want to have WAL archive for at least N hours but I have a limit of M >> gigabytes to store it". >> >> Do we have checkpoint timestamp stored anywhere? (cp start markers?) >> Perhaps we can actually implement this? >> >> Thanks, >> Stan >> >>> On 6 May 2021, at 14:13, Stanislav Lukyanov wrote: >>> >>> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize >>> +1 to add a public property to replace >>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE >>> >>> I don't like the name getWalArchiveSize - I think it's a bit confusing (is >>> it the current size? the minimal size? the target size?) >>> I suggest to name the property geMintWalArchiveSize. I think that this is >>> exactly what it is - the minimal size of the archive that we want to have. >>> The archive size at all times should be between min and max. >>> If archive size is less than min or more than max then the system >>> functionality can degrade (e.g. historical rebalance may not work as >>> expected). >>> I think these rules are intuitively understood from the "min" and "max" >>> names. >>> >>> Ilya's suggestion about throttling is great although I'd do this in a >>> different ticket. >>> >>> Thanks, >>> Stan >>> >>>> On 5 May 2021, at 19:25, Maxim Muzafarov wrote: >>>> >>>> Hello, Kirill >>>> >>>> +1 for this change, however, there are too many configuration settings >>>> that exist for the user to configure Ignite cluster. It is better to >>>> keep the options that we already have and fix the behaviour of the >>>> rebalance process as you suggested. >>>> >>>> On Tue, 4 May 2021 at 19:01, ткаленко кирилл wrote: >>>>> Hi Ilya! >>>>> >>>>> Then we can greatly reduce the user load on the cluster until the >>>>> rebalance is over. Which can be critical for the user. >>>>> >>>>> 04.05.2021, 18:43, "Ilya Kasnacheev" : >>>>>> Hello! >>>>>> >>>>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based >>>>>> write throttling? >>>>>> >>>>>> So we will be throttling for both checkpoint page buffer and WAL limit. >>>>>> >>>>>> Regards, >>>>>> -- >>>>>> Ilya Kasnacheev >>>>>> >>>>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : >>>>>> >>>>>>> Hello everybody! >>>>>>> >>>>>>> At the moment, if there are partitions for the rebalance for which the >>>>>>> historical rebal
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Stas hello! I didn't quite get your last idea. What will we do if we reach getMaxWalArchiveSize? Shall we not delete the segment until minWalArchiveTimespan? 06.05.2021, 20:00, "Stanislav Lukyanov" : > An interesting suggestion I heard today. > > The minWalArchiveSize property might actually be minWalArchiveTimespan - i.e. > be a number of seconds instead of a number of bytes! > > I think this makes perfect sense from the user point of view. > "I want to have WAL archive for at least N hours but I have a limit of M > gigabytes to store it". > > Do we have checkpoint timestamp stored anywhere? (cp start markers?) > Perhaps we can actually implement this? > > Thanks, > Stan > >> On 6 May 2021, at 14:13, Stanislav Lukyanov wrote: >> >> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize >> +1 to add a public property to replace >> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE >> >> I don't like the name getWalArchiveSize - I think it's a bit confusing (is >> it the current size? the minimal size? the target size?) >> I suggest to name the property geMintWalArchiveSize. I think that this is >> exactly what it is - the minimal size of the archive that we want to have. >> The archive size at all times should be between min and max. >> If archive size is less than min or more than max then the system >> functionality can degrade (e.g. historical rebalance may not work as >> expected). >> I think these rules are intuitively understood from the "min" and "max" >> names. >> >> Ilya's suggestion about throttling is great although I'd do this in a >> different ticket. >> >> Thanks, >> Stan >> >>> On 5 May 2021, at 19:25, Maxim Muzafarov wrote: >>> >>> Hello, Kirill >>> >>> +1 for this change, however, there are too many configuration settings >>> that exist for the user to configure Ignite cluster. It is better to >>> keep the options that we already have and fix the behaviour of the >>> rebalance process as you suggested. >>> >>> On Tue, 4 May 2021 at 19:01, ткаленко кирилл wrote: >>>> Hi Ilya! >>>> >>>> Then we can greatly reduce the user load on the cluster until the >>>> rebalance is over. Which can be critical for the user. >>>> >>>> 04.05.2021, 18:43, "Ilya Kasnacheev" : >>>>> Hello! >>>>> >>>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based >>>>> write throttling? >>>>> >>>>> So we will be throttling for both checkpoint page buffer and WAL limit. >>>>> >>>>> Regards, >>>>> -- >>>>> Ilya Kasnacheev >>>>> >>>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : >>>>> >>>>>> Hello everybody! >>>>>> >>>>>> At the moment, if there are partitions for the rebalance for which the >>>>>> historical rebalance will be used, then we reserve segments in the WAL >>>>>> archive (we do not allow cleaning the WAL archive) until the rebalance >>>>>> for >>>>>> all cache groups is over. >>>>>> >>>>>> If a cluster is under load during the rebalance, WAL archive size may >>>>>> significantly exceed limits set in >>>>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is >>>>>> complete. This may lead to user issues and nodes may crash with the "No >>>>>> space left on device" error. >>>>>> >>>>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE >>>>>> by >>>>>> default 0.5, which sets the threshold (multiplied by >>>>>> getMaxWalArchiveSize) >>>>>> from which and up to which the WAL archive will be cleared, i.e. sets >>>>>> the >>>>>> size of the WAL archive that will always be on the node. I propose to >>>>>> replace this system property with the >>>>>> DataStorageConfiguration#getWalArchiveSize in bytes, the default is >>>>>> (getMaxWalArchiveSize * 0.5) as it is now. >>>>>> >>>>>> Main proposal: >>>>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel >>>>>> and do not give the reservation of the WAL segments until we reach >>>>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no >>>>>> segment for historical rebalance, we will automatically switch to full >>>>>> rebalance.
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
An interesting suggestion I heard today. The minWalArchiveSize property might actually be minWalArchiveTimespan - i.e. be a number of seconds instead of a number of bytes! I think this makes perfect sense from the user point of view. "I want to have WAL archive for at least N hours but I have a limit of M gigabytes to store it". Do we have checkpoint timestamp stored anywhere? (cp start markers?) Perhaps we can actually implement this? Thanks, Stan > On 6 May 2021, at 14:13, Stanislav Lukyanov wrote: > > +1 to cancel WAL reservation on reaching getMaxWalArchiveSize > +1 to add a public property to replace > IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE > > I don't like the name getWalArchiveSize - I think it's a bit confusing (is it > the current size? the minimal size? the target size?) > I suggest to name the property geMintWalArchiveSize. I think that this is > exactly what it is - the minimal size of the archive that we want to have. > The archive size at all times should be between min and max. > If archive size is less than min or more than max then the system > functionality can degrade (e.g. historical rebalance may not work as > expected). > I think these rules are intuitively understood from the "min" and "max" names. > > Ilya's suggestion about throttling is great although I'd do this in a > different ticket. > > Thanks, > Stan > >> On 5 May 2021, at 19:25, Maxim Muzafarov wrote: >> >> Hello, Kirill >> >> +1 for this change, however, there are too many configuration settings >> that exist for the user to configure Ignite cluster. It is better to >> keep the options that we already have and fix the behaviour of the >> rebalance process as you suggested. >> >> On Tue, 4 May 2021 at 19:01, ткаленко кирилл wrote: >>> >>> Hi Ilya! >>> >>> Then we can greatly reduce the user load on the cluster until the rebalance >>> is over. Which can be critical for the user. >>> >>> 04.05.2021, 18:43, "Ilya Kasnacheev" : >>>> Hello! >>>> >>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based >>>> write throttling? >>>> >>>> So we will be throttling for both checkpoint page buffer and WAL limit. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : >>>> >>>>> Hello everybody! >>>>> >>>>> At the moment, if there are partitions for the rebalance for which the >>>>> historical rebalance will be used, then we reserve segments in the WAL >>>>> archive (we do not allow cleaning the WAL archive) until the rebalance for >>>>> all cache groups is over. >>>>> >>>>> If a cluster is under load during the rebalance, WAL archive size may >>>>> significantly exceed limits set in >>>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is >>>>> complete. This may lead to user issues and nodes may crash with the "No >>>>> space left on device" error. >>>>> >>>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by >>>>> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) >>>>> from which and up to which the WAL archive will be cleared, i.e. sets the >>>>> size of the WAL archive that will always be on the node. I propose to >>>>> replace this system property with the >>>>> DataStorageConfiguration#getWalArchiveSize in bytes, the default is >>>>> (getMaxWalArchiveSize * 0.5) as it is now. >>>>> >>>>> Main proposal: >>>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel >>>>> and do not give the reservation of the WAL segments until we reach >>>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no >>>>> segment for historical rebalance, we will automatically switch to full >>>>> rebalance. >
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
+1 to cancel WAL reservation on reaching getMaxWalArchiveSize +1 to add a public property to replace IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE I don't like the name getWalArchiveSize - I think it's a bit confusing (is it the current size? the minimal size? the target size?) I suggest to name the property geMintWalArchiveSize. I think that this is exactly what it is - the minimal size of the archive that we want to have. The archive size at all times should be between min and max. If archive size is less than min or more than max then the system functionality can degrade (e.g. historical rebalance may not work as expected). I think these rules are intuitively understood from the "min" and "max" names. Ilya's suggestion about throttling is great although I'd do this in a different ticket. Thanks, Stan > On 5 May 2021, at 19:25, Maxim Muzafarov wrote: > > Hello, Kirill > > +1 for this change, however, there are too many configuration settings > that exist for the user to configure Ignite cluster. It is better to > keep the options that we already have and fix the behaviour of the > rebalance process as you suggested. > > On Tue, 4 May 2021 at 19:01, ткаленко кирилл wrote: >> >> Hi Ilya! >> >> Then we can greatly reduce the user load on the cluster until the rebalance >> is over. Which can be critical for the user. >> >> 04.05.2021, 18:43, "Ilya Kasnacheev" : >>> Hello! >>> >>> Maybe we can have a mechanic here similar (or equal) to checkpoint based >>> write throttling? >>> >>> So we will be throttling for both checkpoint page buffer and WAL limit. >>> >>> Regards, >>> -- >>> Ilya Kasnacheev >>> >>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : >>> >>>> Hello everybody! >>>> >>>> At the moment, if there are partitions for the rebalance for which the >>>> historical rebalance will be used, then we reserve segments in the WAL >>>> archive (we do not allow cleaning the WAL archive) until the rebalance for >>>> all cache groups is over. >>>> >>>> If a cluster is under load during the rebalance, WAL archive size may >>>> significantly exceed limits set in >>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is >>>> complete. This may lead to user issues and nodes may crash with the "No >>>> space left on device" error. >>>> >>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by >>>> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) >>>> from which and up to which the WAL archive will be cleared, i.e. sets the >>>> size of the WAL archive that will always be on the node. I propose to >>>> replace this system property with the >>>> DataStorageConfiguration#getWalArchiveSize in bytes, the default is >>>> (getMaxWalArchiveSize * 0.5) as it is now. >>>> >>>> Main proposal: >>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel >>>> and do not give the reservation of the WAL segments until we reach >>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no >>>> segment for historical rebalance, we will automatically switch to full >>>> rebalance.
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Hello, Kirill +1 for this change, however, there are too many configuration settings that exist for the user to configure Ignite cluster. It is better to keep the options that we already have and fix the behaviour of the rebalance process as you suggested. On Tue, 4 May 2021 at 19:01, ткаленко кирилл wrote: > > Hi Ilya! > > Then we can greatly reduce the user load on the cluster until the rebalance > is over. Which can be critical for the user. > > 04.05.2021, 18:43, "Ilya Kasnacheev" : > > Hello! > > > > Maybe we can have a mechanic here similar (or equal) to checkpoint based > > write throttling? > > > > So we will be throttling for both checkpoint page buffer and WAL limit. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : > > > >> Hello everybody! > >> > >> At the moment, if there are partitions for the rebalance for which the > >> historical rebalance will be used, then we reserve segments in the WAL > >> archive (we do not allow cleaning the WAL archive) until the rebalance for > >> all cache groups is over. > >> > >> If a cluster is under load during the rebalance, WAL archive size may > >> significantly exceed limits set in > >> DataStorageConfiguration#getMaxWalArchiveSize until the process is > >> complete. This may lead to user issues and nodes may crash with the "No > >> space left on device" error. > >> > >> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by > >> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) > >> from which and up to which the WAL archive will be cleared, i.e. sets the > >> size of the WAL archive that will always be on the node. I propose to > >> replace this system property with the > >> DataStorageConfiguration#getWalArchiveSize in bytes, the default is > >> (getMaxWalArchiveSize * 0.5) as it is now. > >> > >> Main proposal: > >> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel > >> and do not give the reservation of the WAL segments until we reach > >> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no > >> segment for historical rebalance, we will automatically switch to full > >> rebalance.
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Hi Ilya! Then we can greatly reduce the user load on the cluster until the rebalance is over. Which can be critical for the user. 04.05.2021, 18:43, "Ilya Kasnacheev" : > Hello! > > Maybe we can have a mechanic here similar (or equal) to checkpoint based > write throttling? > > So we will be throttling for both checkpoint page buffer and WAL limit. > > Regards, > -- > Ilya Kasnacheev > > вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : > >> Hello everybody! >> >> At the moment, if there are partitions for the rebalance for which the >> historical rebalance will be used, then we reserve segments in the WAL >> archive (we do not allow cleaning the WAL archive) until the rebalance for >> all cache groups is over. >> >> If a cluster is under load during the rebalance, WAL archive size may >> significantly exceed limits set in >> DataStorageConfiguration#getMaxWalArchiveSize until the process is >> complete. This may lead to user issues and nodes may crash with the "No >> space left on device" error. >> >> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by >> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) >> from which and up to which the WAL archive will be cleared, i.e. sets the >> size of the WAL archive that will always be on the node. I propose to >> replace this system property with the >> DataStorageConfiguration#getWalArchiveSize in bytes, the default is >> (getMaxWalArchiveSize * 0.5) as it is now. >> >> Main proposal: >> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel >> and do not give the reservation of the WAL segments until we reach >> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no >> segment for historical rebalance, we will automatically switch to full >> rebalance.
Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Hello! Maybe we can have a mechanic here similar (or equal) to checkpoint based write throttling? So we will be throttling for both checkpoint page buffer and WAL limit. Regards, -- Ilya Kasnacheev вт, 4 мая 2021 г. в 11:29, ткаленко кирилл : > Hello everybody! > > At the moment, if there are partitions for the rebalance for which the > historical rebalance will be used, then we reserve segments in the WAL > archive (we do not allow cleaning the WAL archive) until the rebalance for > all cache groups is over. > > If a cluster is under load during the rebalance, WAL archive size may > significantly exceed limits set in > DataStorageConfiguration#getMaxWalArchiveSize until the process is > complete. This may lead to user issues and nodes may crash with the "No > space left on device" error. > > We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by > default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) > from which and up to which the WAL archive will be cleared, i.e. sets the > size of the WAL archive that will always be on the node. I propose to > replace this system property with the > DataStorageConfiguration#getWalArchiveSize in bytes, the default is > (getMaxWalArchiveSize * 0.5) as it is now. > > Main proposal: > When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel > and do not give the reservation of the WAL segments until we reach > DataStorageConfiguration#getWalArchiveSize. In this case, if there is no > segment for historical rebalance, we will automatically switch to full > rebalance. >
Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance
Hello everybody! At the moment, if there are partitions for the rebalance for which the historical rebalance will be used, then we reserve segments in the WAL archive (we do not allow cleaning the WAL archive) until the rebalance for all cache groups is over. If a cluster is under load during the rebalance, WAL archive size may significantly exceed limits set in DataStorageConfiguration#getMaxWalArchiveSize until the process is complete. This may lead to user issues and nodes may crash with the "No space left on device" error. We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) from which and up to which the WAL archive will be cleared, i.e. sets the size of the WAL archive that will always be on the node. I propose to replace this system property with the DataStorageConfiguration#getWalArchiveSize in bytes, the default is (getMaxWalArchiveSize * 0.5) as it is now. Main proposal: When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel and do not give the reservation of the WAL segments until we reach DataStorageConfiguration#getWalArchiveSize. In this case, if there is no segment for historical rebalance, we will automatically switch to full rebalance.
[jira] [Created] (IGNITE-14524) Historical rebalance doesn't work if cache has configured rebalanceDelay
Dmitry Lazurkin created IGNITE-14524: Summary: Historical rebalance doesn't work if cache has configured rebalanceDelay Key: IGNITE-14524 URL: https://issues.apache.org/jira/browse/IGNITE-14524 Project: Ignite Issue Type: Bug Affects Versions: 2.10 Reporter: Dmitry Lazurkin I have big cache with configured rebalanceMode = ASYNC, rebalanceDelay = 10_000ms. Persistence is enabled, maxWalArchiveSize = 10GB. And I passed -DIGNITE_PREFER_WAL_REBALANCE=true and -DIGNITE_PDS_WAL_REBALANCE_THRESHOLD=1 to Ignite. So node should use historical rebalance if there is enough WAL. But it doesn't. After investigation I found that GridDhtPreloader#generateAssignments always get called with exchFut = null, and this method can't set histPartitions without exchFut. I think, that problem in GridCachePartitionExchangeManager (https://github.com/apache/ignite/blob/bc24f6baf3e9b4f98cf98cc5df67fb5deb5ceb6c/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L3486). It doesn't call generateAssignments without forcePreload if rebalanceDelay is configured. Historical rebalance works after removing rebalanceDelay. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14138) Historical rebalance kills cluster
Vladislav Pyatkov created IGNITE-14138: -- Summary: Historical rebalance kills cluster Key: IGNITE-14138 URL: https://issues.apache.org/jira/browse/IGNITE-14138 Project: Ignite Issue Type: Bug Reporter: Vladislav Pyatkov {noformat} [2021-01-12T05:11:02,142][ERROR][rebalance-#508%---%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.IgniteCheckedException: Failed to continue supplying [grp=SQL_USAGES_EPE, demander=48254935-7aa9-4ab5-b398-fdaec334fab7, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1 org.apache.ignite.IgniteCheckedException: Failed to continue supplying [grp=SQL_1, demander=48254935-7aa9-4ab5-b398-fdaec334fab7, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1]] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:571) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:398) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:489) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:474) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$800(GridCacheIoManager.java:109) [ignite-core.jar] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1707) [ignite-core.jar] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721) [ignite-core.jar] at org.apache.ignite.internal.managers.communication.GridIoManager.access$4300(GridIoManager.java:157) [ignite-core.jar] at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:3011) [ignite-core.jar] at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1662) [ignite-core.jar] at org.apache.ignite.internal.managers.communication.GridIoManager.access$4900(GridIoManager.java:157) [ignite-core.jar] at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1629) [ignite-core.jar] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:834) [?:?] Caused by: org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition [part=4, partCntrSince=1115] at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointHistory.searchEarliestWalPointer(CheckpointHistory.java:557) ~[ignite-core.jar] at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:1121) ~[ignite-core.jar] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:1195) ~[ignite-core.jar] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:322) ~[ignite-core.jar] ... 16 more {noformat} I believe that it should throw IgniteHistoricalIteratorException instead of IgniteCheckedException, so it can be properly handled and rebalance can move to the full rebalance instead of killing nodes -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Choosing historical rebalance heuristics
Hi. The proposed approach to improve rebalancing efficiency looks a little bit naive to me. A mature solution to a problem should include a computation of a "cost" for each rebalancing type. The cost formula should take into consideration various aspects, like number of pages in the memory cache, page replacement, a disk read speed, a number of indexes, etc. For a short term solution the proposed heuristic looks acceptable. My suggestions for an implementation: 1. Implement the cost framework right now. The rebalance cost is measured as a sum of all costs for each rebalancing group. This sum should be minimized. Rebalancing behavior should be dependent only on cost formula outcome, without the need of modifying other code. 2. Avoid unnecessary WAL scanning. Ideally we should scan only segments containing required updates. We can build for each partition a tracking data structure and adjust it on each checkpoint and WAL archive removal, then use it for efficient scanning during WAL iteration. 3. Avoid any calculations during the PME sync phase. The coordinator should only collect available WAL history, switch lagging OWNING partitions to MOVING, and send all this to the nodes. After PME nodes should calculate a cost and choose the most efficient rebalancing for local data. 4. We should log a reason for choosing a specific rebalancing type due to a cost. I'm planning to implement tombstones in the near future. This should remove the requirement for clearing a partition before full rebalancing in most cases, improving it's cost. This can be further enhanced later using some kind of merkle trees, file based rebalancing or something else. чт, 16 июл. 2020 г. в 18:33, Ivan Rakov : > > > > I think we can modify the heuristic so > > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD - > > reduce it to 500) > > 2) Select only that partition for historical rebalance where difference > > between counters less that partition size. > > Agreed, let's go this way. > > On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov > wrote: > > > I completely forget about another promise to favor of using historical > > rebalance where it is possible. When cluster decided to use a full > balance, > > demander nodes should clear not empty partitions. > > This can to consume a long time, in some cases that may be compared with > a > > time of rebalance. > > It also accepts a side of heuristics above. > > > > On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov > > > wrote: > > > > > Ivan, > > > > > > I agree with a combined approach: threshold for small partitions and > > count > > > of update for partition that outgrew it. > > > This helps to avoid partitions that update not frequently. > > > > > > Reading of a big WAL piece (more than 100Gb) it can happen, when a > client > > > configured it intentionally. > > > There are no doubts we can to read it, otherwise WAL space was not > > > configured that too large. > > > > > > I don't see a connection optimization of iterator and issue in atomic > > > protocol. > > > Reordering in WAL, that happened in checkpoint where counter was not > > > changing, is an extremely rare case and the issue will not solve for > > > generic case, this should be fixed in bound of protocol. > > > > > > I think we can modify the heuristic so > > > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD > - > > > reduce it to 500) > > > 2) Select only that partition for historical rebalance where difference > > > between counters less that partition size. > > > > > > Also implement mentioned optimization for historical iterator, that may > > > reduce a time on reading large WAL interval. > > > > > > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov > > wrote: > > > > > >> Hi Vladislav, > > >> > > >> Thanks for raising this topic. > > >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is > > 500_000) > > >> is controversial. Assuming that the default number of partitions is > > 1024, > > >> cache should contain a really huge amount of data in order to make WAL > > >> delta rebalancing possible. In fact, it's currently disabled for most > > >> production cases, which makes rebalancing of persistent caches > > >> unreasonably > > >> long. > > >> > > >> I think, your approach [1] makes much more sense than the current > > >> heuristic, let's move forward with the proposed solution. > &
Re: Choosing historical rebalance heuristics
> > I think we can modify the heuristic so > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD - > reduce it to 500) > 2) Select only that partition for historical rebalance where difference > between counters less that partition size. Agreed, let's go this way. On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov wrote: > I completely forget about another promise to favor of using historical > rebalance where it is possible. When cluster decided to use a full balance, > demander nodes should clear not empty partitions. > This can to consume a long time, in some cases that may be compared with a > time of rebalance. > It also accepts a side of heuristics above. > > On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov > wrote: > > > Ivan, > > > > I agree with a combined approach: threshold for small partitions and > count > > of update for partition that outgrew it. > > This helps to avoid partitions that update not frequently. > > > > Reading of a big WAL piece (more than 100Gb) it can happen, when a client > > configured it intentionally. > > There are no doubts we can to read it, otherwise WAL space was not > > configured that too large. > > > > I don't see a connection optimization of iterator and issue in atomic > > protocol. > > Reordering in WAL, that happened in checkpoint where counter was not > > changing, is an extremely rare case and the issue will not solve for > > generic case, this should be fixed in bound of protocol. > > > > I think we can modify the heuristic so > > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD - > > reduce it to 500) > > 2) Select only that partition for historical rebalance where difference > > between counters less that partition size. > > > > Also implement mentioned optimization for historical iterator, that may > > reduce a time on reading large WAL interval. > > > > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov > wrote: > > > >> Hi Vladislav, > >> > >> Thanks for raising this topic. > >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is > 500_000) > >> is controversial. Assuming that the default number of partitions is > 1024, > >> cache should contain a really huge amount of data in order to make WAL > >> delta rebalancing possible. In fact, it's currently disabled for most > >> production cases, which makes rebalancing of persistent caches > >> unreasonably > >> long. > >> > >> I think, your approach [1] makes much more sense than the current > >> heuristic, let's move forward with the proposed solution. > >> > >> Though, there are some other corner cases, e.g. this one: > >> - Configured size of WAL archive is big (>100 GB) > >> - Cache has small partitions (e.g. 1000 entries) > >> - Infrequent updates (e.g. ~100 in the whole WAL history of any node) > >> - There is another cache with very frequent updates which allocate >99% > of > >> WAL > >> In such scenario we may need to iterate over >100 GB of WAL in order to > >> fetch <1% of needed updates. Even though the amount of network traffic > is > >> still optimized, it would be more effective to transfer partitions with > >> ~1000 entries fully instead of reading >100 GB of WAL. > >> > >> I want to highlight that your heuristic definitely makes the situation > >> better, but due to possible corner cases we should keep the fallback > lever > >> to restrict or limit historical rebalance as before. Probably, it would > be > >> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low > >> default value (1000, 500 or even 0) and apply your heuristic only for > >> partitions with bigger size. > >> > >> Regarding case [2]: it looks like an improvement that can mitigate some > >> corner cases (including the one that I have described). I'm ok with it > as > >> long as it takes data updates reordering on backup nodes into account. > We > >> don't track skipped updates for atomic caches. As a result, detection of > >> the absence of updates between two checkpoint markers with the same > >> partition counter can be false positive. > >> > >> -- > >> Best Regards, > >> Ivan Rakov > >> > >> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov > > >> wrote: > >> > >> > Hi guys, > >> > > >> > I want to implement a more honest heuristic for historical re
Re: Choosing historical rebalance heuristics
I completely forget about another promise to favor of using historical rebalance where it is possible. When cluster decided to use a full balance, demander nodes should clear not empty partitions. This can to consume a long time, in some cases that may be compared with a time of rebalance. It also accepts a side of heuristics above. On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov wrote: > Ivan, > > I agree with a combined approach: threshold for small partitions and count > of update for partition that outgrew it. > This helps to avoid partitions that update not frequently. > > Reading of a big WAL piece (more than 100Gb) it can happen, when a client > configured it intentionally. > There are no doubts we can to read it, otherwise WAL space was not > configured that too large. > > I don't see a connection optimization of iterator and issue in atomic > protocol. > Reordering in WAL, that happened in checkpoint where counter was not > changing, is an extremely rare case and the issue will not solve for > generic case, this should be fixed in bound of protocol. > > I think we can modify the heuristic so > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD - > reduce it to 500) > 2) Select only that partition for historical rebalance where difference > between counters less that partition size. > > Also implement mentioned optimization for historical iterator, that may > reduce a time on reading large WAL interval. > > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov wrote: > >> Hi Vladislav, >> >> Thanks for raising this topic. >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000) >> is controversial. Assuming that the default number of partitions is 1024, >> cache should contain a really huge amount of data in order to make WAL >> delta rebalancing possible. In fact, it's currently disabled for most >> production cases, which makes rebalancing of persistent caches >> unreasonably >> long. >> >> I think, your approach [1] makes much more sense than the current >> heuristic, let's move forward with the proposed solution. >> >> Though, there are some other corner cases, e.g. this one: >> - Configured size of WAL archive is big (>100 GB) >> - Cache has small partitions (e.g. 1000 entries) >> - Infrequent updates (e.g. ~100 in the whole WAL history of any node) >> - There is another cache with very frequent updates which allocate >99% of >> WAL >> In such scenario we may need to iterate over >100 GB of WAL in order to >> fetch <1% of needed updates. Even though the amount of network traffic is >> still optimized, it would be more effective to transfer partitions with >> ~1000 entries fully instead of reading >100 GB of WAL. >> >> I want to highlight that your heuristic definitely makes the situation >> better, but due to possible corner cases we should keep the fallback lever >> to restrict or limit historical rebalance as before. Probably, it would be >> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low >> default value (1000, 500 or even 0) and apply your heuristic only for >> partitions with bigger size. >> >> Regarding case [2]: it looks like an improvement that can mitigate some >> corner cases (including the one that I have described). I'm ok with it as >> long as it takes data updates reordering on backup nodes into account. We >> don't track skipped updates for atomic caches. As a result, detection of >> the absence of updates between two checkpoint markers with the same >> partition counter can be false positive. >> >> -- >> Best Regards, >> Ivan Rakov >> >> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov >> wrote: >> >> > Hi guys, >> > >> > I want to implement a more honest heuristic for historical rebalance. >> > Before, a cluster makes a choice between the historical rebalance or >> not it >> > only from a partition size. This threshold more known by a name of >> property >> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD. >> > It might prevent a historical rebalance when a partition is too small, >> but >> > not if WAL contains more updates than a size of partition, historical >> > rebalance still can be chosen. >> > There is a ticket where need to implement more fair heuristic[1]. >> > >> > My idea for implementation is need to estimate a size of data which >> will be >> > transferred owe network. In other word if need to rebalance a part of >> WAL >> > that contains N updates, for recover a partition on another node,
Re: Choosing historical rebalance heuristics
Ivan, I agree with a combined approach: threshold for small partitions and count of update for partition that outgrew it. This helps to avoid partitions that update not frequently. Reading of a big WAL piece (more than 100Gb) it can happen, when a client configured it intentionally. There are no doubts we can to read it, otherwise WAL space was not configured that too large. I don't see a connection optimization of iterator and issue in atomic protocol. Reordering in WAL, that happened in checkpoint where counter was not changing, is an extremely rare case and the issue will not solve for generic case, this should be fixed in bound of protocol. I think we can modify the heuristic so 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD - reduce it to 500) 2) Select only that partition for historical rebalance where difference between counters less that partition size. Also implement mentioned optimization for historical iterator, that may reduce a time on reading large WAL interval. On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov wrote: > Hi Vladislav, > > Thanks for raising this topic. > Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000) > is controversial. Assuming that the default number of partitions is 1024, > cache should contain a really huge amount of data in order to make WAL > delta rebalancing possible. In fact, it's currently disabled for most > production cases, which makes rebalancing of persistent caches unreasonably > long. > > I think, your approach [1] makes much more sense than the current > heuristic, let's move forward with the proposed solution. > > Though, there are some other corner cases, e.g. this one: > - Configured size of WAL archive is big (>100 GB) > - Cache has small partitions (e.g. 1000 entries) > - Infrequent updates (e.g. ~100 in the whole WAL history of any node) > - There is another cache with very frequent updates which allocate >99% of > WAL > In such scenario we may need to iterate over >100 GB of WAL in order to > fetch <1% of needed updates. Even though the amount of network traffic is > still optimized, it would be more effective to transfer partitions with > ~1000 entries fully instead of reading >100 GB of WAL. > > I want to highlight that your heuristic definitely makes the situation > better, but due to possible corner cases we should keep the fallback lever > to restrict or limit historical rebalance as before. Probably, it would be > handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low > default value (1000, 500 or even 0) and apply your heuristic only for > partitions with bigger size. > > Regarding case [2]: it looks like an improvement that can mitigate some > corner cases (including the one that I have described). I'm ok with it as > long as it takes data updates reordering on backup nodes into account. We > don't track skipped updates for atomic caches. As a result, detection of > the absence of updates between two checkpoint markers with the same > partition counter can be false positive. > > -- > Best Regards, > Ivan Rakov > > On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov > wrote: > > > Hi guys, > > > > I want to implement a more honest heuristic for historical rebalance. > > Before, a cluster makes a choice between the historical rebalance or not > it > > only from a partition size. This threshold more known by a name of > property > > IGNITE_PDS_WAL_REBALANCE_THRESHOLD. > > It might prevent a historical rebalance when a partition is too small, > but > > not if WAL contains more updates than a size of partition, historical > > rebalance still can be chosen. > > There is a ticket where need to implement more fair heuristic[1]. > > > > My idea for implementation is need to estimate a size of data which will > be > > transferred owe network. In other word if need to rebalance a part of WAL > > that contains N updates, for recover a partition on another node, which > > have to contain M rows at all, need chooses a historical rebalance on the > > case where N < M (WAL history should be presented as well). > > > > This approach is easy implemented, because a coordinator node has the > size > > of partitions and counters' interval. But in this case cluster still can > > find not many updates in too long WAL history. I assume a possibility to > > work around it, if rebalance historical iterator will not handle > > checkpoints where not contains updates of particular cache. Checkpoints > can > > skip if counters for the cache (maybe even a specific partitions) was not > > changed between it and next one. > > > > Ticket for improvement rebalance historical iterator[2] > > > > I want to hear a view of community on the thought above. > > Maybe anyone has another opinion? > > > > [1]: https://issues.apache.org/jira/browse/IGNITE-13253 > > [2]: https://issues.apache.org/jira/browse/IGNITE-13254 > > > > -- > > Vladislav Pyatkov > > > -- Vladislav Pyatkov
Re: Choosing historical rebalance heuristics
Hi Vladislav, Thanks for raising this topic. Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000) is controversial. Assuming that the default number of partitions is 1024, cache should contain a really huge amount of data in order to make WAL delta rebalancing possible. In fact, it's currently disabled for most production cases, which makes rebalancing of persistent caches unreasonably long. I think, your approach [1] makes much more sense than the current heuristic, let's move forward with the proposed solution. Though, there are some other corner cases, e.g. this one: - Configured size of WAL archive is big (>100 GB) - Cache has small partitions (e.g. 1000 entries) - Infrequent updates (e.g. ~100 in the whole WAL history of any node) - There is another cache with very frequent updates which allocate >99% of WAL In such scenario we may need to iterate over >100 GB of WAL in order to fetch <1% of needed updates. Even though the amount of network traffic is still optimized, it would be more effective to transfer partitions with ~1000 entries fully instead of reading >100 GB of WAL. I want to highlight that your heuristic definitely makes the situation better, but due to possible corner cases we should keep the fallback lever to restrict or limit historical rebalance as before. Probably, it would be handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low default value (1000, 500 or even 0) and apply your heuristic only for partitions with bigger size. Regarding case [2]: it looks like an improvement that can mitigate some corner cases (including the one that I have described). I'm ok with it as long as it takes data updates reordering on backup nodes into account. We don't track skipped updates for atomic caches. As a result, detection of the absence of updates between two checkpoint markers with the same partition counter can be false positive. -- Best Regards, Ivan Rakov On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov wrote: > Hi guys, > > I want to implement a more honest heuristic for historical rebalance. > Before, a cluster makes a choice between the historical rebalance or not it > only from a partition size. This threshold more known by a name of property > IGNITE_PDS_WAL_REBALANCE_THRESHOLD. > It might prevent a historical rebalance when a partition is too small, but > not if WAL contains more updates than a size of partition, historical > rebalance still can be chosen. > There is a ticket where need to implement more fair heuristic[1]. > > My idea for implementation is need to estimate a size of data which will be > transferred owe network. In other word if need to rebalance a part of WAL > that contains N updates, for recover a partition on another node, which > have to contain M rows at all, need chooses a historical rebalance on the > case where N < M (WAL history should be presented as well). > > This approach is easy implemented, because a coordinator node has the size > of partitions and counters' interval. But in this case cluster still can > find not many updates in too long WAL history. I assume a possibility to > work around it, if rebalance historical iterator will not handle > checkpoints where not contains updates of particular cache. Checkpoints can > skip if counters for the cache (maybe even a specific partitions) was not > changed between it and next one. > > Ticket for improvement rebalance historical iterator[2] > > I want to hear a view of community on the thought above. > Maybe anyone has another opinion? > > [1]: https://issues.apache.org/jira/browse/IGNITE-13253 > [2]: https://issues.apache.org/jira/browse/IGNITE-13254 > > -- > Vladislav Pyatkov >
Choosing historical rebalance heuristics
Hi guys, I want to implement a more honest heuristic for historical rebalance. Before, a cluster makes a choice between the historical rebalance or not it only from a partition size. This threshold more known by a name of property IGNITE_PDS_WAL_REBALANCE_THRESHOLD. It might prevent a historical rebalance when a partition is too small, but not if WAL contains more updates than a size of partition, historical rebalance still can be chosen. There is a ticket where need to implement more fair heuristic[1]. My idea for implementation is need to estimate a size of data which will be transferred owe network. In other word if need to rebalance a part of WAL that contains N updates, for recover a partition on another node, which have to contain M rows at all, need chooses a historical rebalance on the case where N < M (WAL history should be presented as well). This approach is easy implemented, because a coordinator node has the size of partitions and counters' interval. But in this case cluster still can find not many updates in too long WAL history. I assume a possibility to work around it, if rebalance historical iterator will not handle checkpoints where not contains updates of particular cache. Checkpoints can skip if counters for the cache (maybe even a specific partitions) was not changed between it and next one. Ticket for improvement rebalance historical iterator[2] I want to hear a view of community on the thought above. Maybe anyone has another opinion? [1]: https://issues.apache.org/jira/browse/IGNITE-13253 [2]: https://issues.apache.org/jira/browse/IGNITE-13254 -- Vladislav Pyatkov
[jira] [Created] (IGNITE-13254) Historical rebalance iterator may skip checkpoint if it not contains updates
Vladislav Pyatkov created IGNITE-13254: -- Summary: Historical rebalance iterator may skip checkpoint if it not contains updates Key: IGNITE-13254 URL: https://issues.apache.org/jira/browse/IGNITE-13254 Project: Ignite Issue Type: Improvement Reporter: Vladislav Pyatkov -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13253) Advanced heuristics for historical rebalance
Vladislav Pyatkov created IGNITE-13253: -- Summary: Advanced heuristics for historical rebalance Key: IGNITE-13253 URL: https://issues.apache.org/jira/browse/IGNITE-13253 Project: Ignite Issue Type: Improvement Reporter: Vladislav Pyatkov Before, cluster detects partitions that have not to rebalance by history, by them size. This threshold might be set through a system property IGNITE_PDS_WAL_REBALANCE_THRESHOLD. But it is not fair deciding which partitions will be rebalanced by WAL only by them size. WAL can have much more records than size of a partition (many update by one key) and that rebalance required more data than full transferring by network. Need to implement a heuristic, that might to estimate data size. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13168) Retrigger historical rebalance if it was cancelled in case WAL history is still available
Vladislav Pyatkov created IGNITE-13168: -- Summary: Retrigger historical rebalance if it was cancelled in case WAL history is still available Key: IGNITE-13168 URL: https://issues.apache.org/jira/browse/IGNITE-13168 Project: Ignite Issue Type: Improvement Reporter: Vladislav Pyatkov If historical rebalance is cancelled, full rebalance will be unconditionally triggered on the PME that caused the cancellation (only outdated OWNING partitions can be rebalanced by history in the current implementation). We have to allow MOVING partitions to be historically rebalanced as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12935) Disadvantages in log of historical rebalance
Vladislav Pyatkov created IGNITE-12935: -- Summary: Disadvantages in log of historical rebalance Key: IGNITE-12935 URL: https://issues.apache.org/jira/browse/IGNITE-12935 Project: Ignite Issue Type: Improvement Reporter: Vladislav Pyatkov # Mention in the log only partitions for which there are no nodes that suit as historical supplier For these partitions, print minimal counter (since which we should perform historical rebalancing) with corresponding node and maximum reserved counter (since which cluster can perform historical rebalancing) with corresponding node. This will let us know: ## Whether history was reserved at all ## How much reserved history we lack to perform a historical rebalancing ## I see resulting output like this: Historical rebalancing wasn't scheduled for some partitions: History wasn't reserved for: [list of partitions and groups] History was reserved, but minimum present counter is less than maximum reserved: [[grp=GRP, part=ID, minCntr=cntr, minNodeId=ID, maxReserved=cntr, maxReservedNodeId=ID], ...] ## We can also aggregate previous message by (minNodeId) to easily find the exact node (or nodes) which were the reason of full rebalance. # Log results of reserveHistoryForExchange(). They can be compactly represented as mappings: (grpId -> checkpoint (id, timestamp)). For every group, also log message about why the previous checkpoint wasn't successfully reserved. There can be three reasons: ## Previous checkpoint simply isn't present in the history (the oldest is reserved) ## WAL reservation failure (call below returned false) {code:java} chpEntry = entry(cpTs);boolean reserved = cctx.wal().reserve(chpEntry.checkpointMark());// If checkpoint WAL history can't be reserved, stop searching. if (!reserved) break; {code} ## Checkpoint was marked as inapplicable for historical rebalancing {code:java} for (Integer grpId : new HashSet<>(groupsAndPartitions.keySet())) if (!isCheckpointApplicableForGroup(grpId, chpEntry)) groupsAndPartitions.remove(grpId); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12495) Make historical rebalance wokring per-partition level
Maxim Muzafarov created IGNITE-12495: Summary: Make historical rebalance wokring per-partition level Key: IGNITE-12495 URL: https://issues.apache.org/jira/browse/IGNITE-12495 Project: Ignite Issue Type: Sub-task Reporter: Maxim Muzafarov Currently, historical rebalance run after all cache group partition files have been preloaded and inited. To make file-rebalance faster historical rebalance can be started on per-partition level right after each cache group partition file loaded and inited. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable
Ivan Rakov created IGNITE-12429: --- Summary: Rework bytes-based WAL archive size management logic to make historical rebalance more predictable Key: IGNITE-12429 URL: https://issues.apache.org/jira/browse/IGNITE-12429 Project: Ignite Issue Type: Improvement Reporter: Ivan Rakov Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more trasparent to user. Unfortunately, new logic may be unpredictable when it comes to the historical rebalance. WAL archive is truncated when one of the following conditions occur: 1. Total number of checkpoints in WAL archive is bigger than DataStorageConfiguration#walHistSize 2. Total size of WAL archive is bigger than DataStorageConfiguration#maxWalArchiveSize Independently, in-memory checkpoint history contains only fixed number of last checkpoints (can be changed with IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 100 by default). All these particular qualities make it hard for user to cotrol usage of historical rebalance. Imagine the case when user has slight load (WAL gets rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 minutes, all updates in WAL will be impossible to be received via historical rebalance even if: 1. User has configured large DataStorageConfiguration#maxWalArchiveSize 2. User has configured large DataStorageConfiguration#walHistSize At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE will help (only with previous two points combined), but Ignite node heap usage may increase dramatically. I propose to change WAL history management logic in the following way: 1. *Don't* cut WAL archive when number of checkpoint exceeds DataStorageConfiguration#walHistSize. WAL history should be managed only based on DataStorageConfiguration#maxWalArchiveSize. 2. Checkpoint history should contain fixed number of entries, but should cover the whole stored WAL archive (not only its more recent part with IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be achieved by making checkpoint history sparse: some intermediate checkpoints *may be not present in history*, but fixed number of checkpoints can be positioned either in uniform distribution (trying to keep fixed number of bytes between two neighbour checkpoints) or exponentially (trying to keep fixed ratio between (size of WAL from checkpoint(N-1) to current write pointer) and (size of WAL from checkpoint(N) to current write pointer). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-12117) Historical rebalance should not be processed in striped way
Anton Vinogradov created IGNITE-12117: - Summary: Historical rebalance should not be processed in striped way Key: IGNITE-12117 URL: https://issues.apache.org/jira/browse/IGNITE-12117 Project: Ignite Issue Type: Task Reporter: Anton Vinogradov We have to investigate is it possible to process historical rebalance like regular (using unstriped pool) and if possible then refactor it. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (IGNITE-11607) Historical rebalance is not possible from partition which was recently rebalanced itself
Alexei Scherbakov created IGNITE-11607: -- Summary: Historical rebalance is not possible from partition which was recently rebalanced itself Key: IGNITE-11607 URL: https://issues.apache.org/jira/browse/IGNITE-11607 Project: Ignite Issue Type: Improvement Reporter: Alexei Scherbakov Fix For: 2.8 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Historical rebalance
Roman, What is the advantage of your algorithm compared to previous one? Previous algorithm does almost the same, but without updating two separate counters, and looks simpler to me. Only one update is sufficient - at transaction commit. When transaction starts we just read currently active update counter (LWM), which is enough for us to know where to start from. Moreover, we do not need to learn any kind of WAL pointers and write additional WAL records. Please note that we are trying to solve more difficult problem - how to rebalance as less WAL as possible in case of long-running transactions. On Mon, Dec 3, 2018 at 2:29 PM Roman Kondakov wrote: > Vladimir, > > the difference between per-transaction basis and update-counters basis > is the fact that at the first update we don't know the actual update > counter of this update - we just count deltas on enlist phase. Actual > update counter of this update will be assigned on transaction commit. > But for per-transaction based the actual HWM is known for each > transaction from the very beginning and this value is the same for > primary and backups. Having this number it is very easy to find where > transaction begins on any node. > > > -- > Kind Regards > Roman Kondakov > > On 03.12.2018 13:46, Vladimir Ozerov wrote: > > Roman, > > > > We already track updates on per-transaction basis. The only difference is > > that instead of doing a single "increment(1)" for transaction we do > > "increment(X)" where X is number of updates in the given transaction. > > > > On Mon, Dec 3, 2018 at 1:16 PM Roman Kondakov > > > wrote: > > > >> Igor, Vladimir, Ivan, > >> > >> perhaps, we are focused too much on update counters. This feature was > >> designed for the continuous queries and it may not be suited well for > >> the historical rebalance. What if we would track updates on > >> per-transaction basis instead of per-update basis? Let's consider two > >> counters: low-water mark (LWM) and high-water mark (HWM) which should be > >> added to each partition. They have the following properties: > >> > >> * HWM - is a plane atomic counter. When Tx makes its first write on > >> primary node it does incrementAndGet for this counter and remembers > >> obtained value within its context. This counter can be considered as tx > >> id within current partition - transactions should maintain per-partition > >> map of their HWM ids. WAL pointer to the first record should remembered > >> in this map. Also this id should be recorded to WAL data records. > >> > >> When Tx sends updates to backups it sends Tx HWM too. When backup > >> receives this message from the primary node it takes HWM and do > >> setIfGreater on the local HWM counter. > >> > >> * LWM - is a plane atomic counter. When Tx terminates (either with > >> commit or rollback) it updates its local LWM in the same manner as > >> update counters do it using holes tracking. For example, if partition's > >> LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update > >> partition LWM until tx with id = 11 is committed. When id = 11 is > >> committed, LWM is set to 12. If we have LWM == N, this means that all > >> transactions with id <= N have been terminated for the current partition > >> and all data is already recorded in the local partition. > >> > >> Brief summary for both counters: HWM - means that partition has already > >> seen at least one update of transactions with id <= HWM. LWM means that > >> partition has all updates made by transactions wth id <= LWM. > >> > >> LWM is always <= HWM. > >> > >> On checkpoint we should store only these two counters in checkpoint > >> record. As optimization we can also store list of pending LWMs - ids > >> which haven't been merged to LWM because of the holes in sequence. > >> > >> Historical rebalance: > >> > >> 1. Demander knows its LWM - all updates before it has been applied. > >> Demander sends LWM to supplier. > >> > >> 2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM > >> (demander) > >> > >> 3. Supplier starts moving forward on WAL until it finds first data > >> record with HWM id = LWM (demander). From this point WAL can be > >> rebalanced to demander. > >> > >> In this approach updates and checkpoints on primary and backup can be > >> reordered in any wa
Re: Historical rebalance
Vladimir, the difference between per-transaction basis and update-counters basis is the fact that at the first update we don't know the actual update counter of this update - we just count deltas on enlist phase. Actual update counter of this update will be assigned on transaction commit. But for per-transaction based the actual HWM is known for each transaction from the very beginning and this value is the same for primary and backups. Having this number it is very easy to find where transaction begins on any node. -- Kind Regards Roman Kondakov On 03.12.2018 13:46, Vladimir Ozerov wrote: Roman, We already track updates on per-transaction basis. The only difference is that instead of doing a single "increment(1)" for transaction we do "increment(X)" where X is number of updates in the given transaction. On Mon, Dec 3, 2018 at 1:16 PM Roman Kondakov wrote: Igor, Vladimir, Ivan, perhaps, we are focused too much on update counters. This feature was designed for the continuous queries and it may not be suited well for the historical rebalance. What if we would track updates on per-transaction basis instead of per-update basis? Let's consider two counters: low-water mark (LWM) and high-water mark (HWM) which should be added to each partition. They have the following properties: * HWM - is a plane atomic counter. When Tx makes its first write on primary node it does incrementAndGet for this counter and remembers obtained value within its context. This counter can be considered as tx id within current partition - transactions should maintain per-partition map of their HWM ids. WAL pointer to the first record should remembered in this map. Also this id should be recorded to WAL data records. When Tx sends updates to backups it sends Tx HWM too. When backup receives this message from the primary node it takes HWM and do setIfGreater on the local HWM counter. * LWM - is a plane atomic counter. When Tx terminates (either with commit or rollback) it updates its local LWM in the same manner as update counters do it using holes tracking. For example, if partition's LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update partition LWM until tx with id = 11 is committed. When id = 11 is committed, LWM is set to 12. If we have LWM == N, this means that all transactions with id <= N have been terminated for the current partition and all data is already recorded in the local partition. Brief summary for both counters: HWM - means that partition has already seen at least one update of transactions with id <= HWM. LWM means that partition has all updates made by transactions wth id <= LWM. LWM is always <= HWM. On checkpoint we should store only these two counters in checkpoint record. As optimization we can also store list of pending LWMs - ids which haven't been merged to LWM because of the holes in sequence. Historical rebalance: 1. Demander knows its LWM - all updates before it has been applied. Demander sends LWM to supplier. 2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM (demander) 3. Supplier starts moving forward on WAL until it finds first data record with HWM id = LWM (demander). From this point WAL can be rebalanced to demander. In this approach updates and checkpoints on primary and backup can be reordered in any way, but we can always find a proper point to read WAL from. Let's consider a couple of examples. In this examples transaction updates marked as w1(a) - transaction 1 updates key=a, c1 - transaction 1 is committed, cp(1, 0) - checkpoint with HWM=1 and LWM=0. (HWM,LWM) - current counters after operation. (HWM,LWM[hole1, hole2]) - counters with holes in LWM. 1. Simple case with no reordering: PRIMARY -w1(a)---cp(1,0)---w2(b)w1(c)--c1c2-cp(2,2) (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2) | || || BACKUP --w1(a)-w2(b)w1(c)---cp(2,0)c1c2-cp(2,2) (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2) In this case if backup failed before c1 it will receive all updates from the beginning (HWM=0). If it fails between c1 and c2, it will receive WAL from primary's cp(1,0), because tx with id=1 is fully processed on backup: HWM(supplier cp(1,0))=1 == LWM(demander)=1 if backup fails after c2, it will receive nothing because it has all updates HWM(supplier)=2 == LWM(demander)=2 2. Case with reordering PRIMARY -w1(a)---cp(1,0)---w2(b)--cp(2,0)--w1(c)--c1-c2---cp(2,2) (HWM,LWM)(1,0) (2,0)
Re: Historical rebalance
Roman, We already track updates on per-transaction basis. The only difference is that instead of doing a single "increment(1)" for transaction we do "increment(X)" where X is number of updates in the given transaction. On Mon, Dec 3, 2018 at 1:16 PM Roman Kondakov wrote: > Igor, Vladimir, Ivan, > > perhaps, we are focused too much on update counters. This feature was > designed for the continuous queries and it may not be suited well for > the historical rebalance. What if we would track updates on > per-transaction basis instead of per-update basis? Let's consider two > counters: low-water mark (LWM) and high-water mark (HWM) which should be > added to each partition. They have the following properties: > > * HWM - is a plane atomic counter. When Tx makes its first write on > primary node it does incrementAndGet for this counter and remembers > obtained value within its context. This counter can be considered as tx > id within current partition - transactions should maintain per-partition > map of their HWM ids. WAL pointer to the first record should remembered > in this map. Also this id should be recorded to WAL data records. > > When Tx sends updates to backups it sends Tx HWM too. When backup > receives this message from the primary node it takes HWM and do > setIfGreater on the local HWM counter. > > * LWM - is a plane atomic counter. When Tx terminates (either with > commit or rollback) it updates its local LWM in the same manner as > update counters do it using holes tracking. For example, if partition's > LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update > partition LWM until tx with id = 11 is committed. When id = 11 is > committed, LWM is set to 12. If we have LWM == N, this means that all > transactions with id <= N have been terminated for the current partition > and all data is already recorded in the local partition. > > Brief summary for both counters: HWM - means that partition has already > seen at least one update of transactions with id <= HWM. LWM means that > partition has all updates made by transactions wth id <= LWM. > > LWM is always <= HWM. > > On checkpoint we should store only these two counters in checkpoint > record. As optimization we can also store list of pending LWMs - ids > which haven't been merged to LWM because of the holes in sequence. > > Historical rebalance: > > 1. Demander knows its LWM - all updates before it has been applied. > Demander sends LWM to supplier. > > 2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM > (demander) > > 3. Supplier starts moving forward on WAL until it finds first data > record with HWM id = LWM (demander). From this point WAL can be > rebalanced to demander. > > In this approach updates and checkpoints on primary and backup can be > reordered in any way, but we can always find a proper point to read WAL > from. > > Let's consider a couple of examples. In this examples transaction > updates marked as w1(a) - transaction 1 updates key=a, c1 - transaction > 1 is committed, cp(1, 0) - checkpoint with HWM=1 and LWM=0. (HWM,LWM) - > current counters after operation. (HWM,LWM[hole1, hole2]) - counters > with holes in LWM. > > > 1. Simple case with no reordering: > > PRIMARY > -w1(a)---cp(1,0)---w2(b)w1(c)--c1c2-cp(2,2) > (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2) >| || || > BACKUP > --w1(a)-w2(b)w1(c)---cp(2,0)c1c2-cp(2,2) > (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2) > > > In this case if backup failed before c1 it will receive all updates from > the beginning (HWM=0). > If it fails between c1 and c2, it will receive WAL from primary's cp(1,0), > because tx with id=1 is fully processed on backup: HWM(supplier cp(1,0))=1 > == LWM(demander)=1 > if backup fails after c2, it will receive nothing because it has all > updates HWM(supplier)=2 == LWM(demander)=2 > > > > 2. Case with reordering > > PRIMARY > -w1(a)---cp(1,0)---w2(b)--cp(2,0)--w1(c)--c1-c2---cp(2,2) > (HWM,LWM)(1,0) (2,0) (2,0) > (2,1) (2,2) > \_ | | > \ | >\___ | | > \__|___ >\__|__ | > | \ > | \| > |\ > BACKUP > -w2(b)---w1(a)cp(2,0)---w1(c)c2---c1-
Re: Historical rebalance
Igor, Vladimir, Ivan, perhaps, we are focused too much on update counters. This feature was designed for the continuous queries and it may not be suited well for the historical rebalance. What if we would track updates on per-transaction basis instead of per-update basis? Let's consider two counters: low-water mark (LWM) and high-water mark (HWM) which should be added to each partition. They have the following properties: * HWM - is a plane atomic counter. When Tx makes its first write on primary node it does incrementAndGet for this counter and remembers obtained value within its context. This counter can be considered as tx id within current partition - transactions should maintain per-partition map of their HWM ids. WAL pointer to the first record should remembered in this map. Also this id should be recorded to WAL data records. When Tx sends updates to backups it sends Tx HWM too. When backup receives this message from the primary node it takes HWM and do setIfGreater on the local HWM counter. * LWM - is a plane atomic counter. When Tx terminates (either with commit or rollback) it updates its local LWM in the same manner as update counters do it using holes tracking. For example, if partition's LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update partition LWM until tx with id = 11 is committed. When id = 11 is committed, LWM is set to 12. If we have LWM == N, this means that all transactions with id <= N have been terminated for the current partition and all data is already recorded in the local partition. Brief summary for both counters: HWM - means that partition has already seen at least one update of transactions with id <= HWM. LWM means that partition has all updates made by transactions wth id <= LWM. LWM is always <= HWM. On checkpoint we should store only these two counters in checkpoint record. As optimization we can also store list of pending LWMs - ids which haven't been merged to LWM because of the holes in sequence. Historical rebalance: 1. Demander knows its LWM - all updates before it has been applied. Demander sends LWM to supplier. 2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM (demander) 3. Supplier starts moving forward on WAL until it finds first data record with HWM id = LWM (demander). From this point WAL can be rebalanced to demander. In this approach updates and checkpoints on primary and backup can be reordered in any way, but we can always find a proper point to read WAL from. Let's consider a couple of examples. In this examples transaction updates marked as w1(a) - transaction 1 updates key=a, c1 - transaction 1 is committed, cp(1, 0) - checkpoint with HWM=1 and LWM=0. (HWM,LWM) - current counters after operation. (HWM,LWM[hole1, hole2]) - counters with holes in LWM. 1. Simple case with no reordering: PRIMARY -w1(a)---cp(1,0)---w2(b)w1(c)--c1c2-cp(2,2) (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2) | || || BACKUP --w1(a)-w2(b)w1(c)---cp(2,0)c1c2-cp(2,2) (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2) In this case if backup failed before c1 it will receive all updates from the beginning (HWM=0). If it fails between c1 and c2, it will receive WAL from primary's cp(1,0), because tx with id=1 is fully processed on backup: HWM(supplier cp(1,0))=1 == LWM(demander)=1 if backup fails after c2, it will receive nothing because it has all updates HWM(supplier)=2 == LWM(demander)=2 2. Case with reordering PRIMARY -w1(a)---cp(1,0)---w2(b)--cp(2,0)--w1(c)--c1-c2---cp(2,2) (HWM,LWM)(1,0) (2,0) (2,0) (2,1) (2,2) \_ | | \ | \___ | | \__|___ \__|__ | | \ | \| | \ BACKUP -w2(b)---w1(a)cp(2,0)---w1(c)c2---c1-cp(2,2) (HWM,LWM) (2,0) (2,0) (2,0) (2,0[2]) (2,2) Note here we have a hole on backup when tx2 has committed earlier than tx1 and LWM wasn't changed at this moment. In last case if backup is failed before c1, the entire WAL will be supplied because LWM=0 until this moment. If backup fails after c1 - there is nothing to rebalance, because HWM(supplier)=2 == LWM(demander)=2 What do you think? -- Kind Regards Roman Kondakov On 30.11.2018 2:01, Seliverstov Igor wrote: Vladimir, Look at my example: One active transaction (Tx1 which does opX ops) while another tx (Tx2 which does opX' ops) is fi
Re: Historical rebalance
Vladimir, Look at my example: One active transaction (Tx1 which does opX ops) while another tx (Tx2 which does opX' ops) is finishes with uc4: uc1--op1op2---uc2--op1'uc3--uc4---op3X- Node1 uc1op1uc2op2uc3--op3--uc4cp1 Tx1 - ^ | | | | -Node2 ^-- | | | uc1-uc2-uc3op1'uc4cp1 Tx2 - state on Node2: tx1 -> op3 -> uc2 cp1 [current=uc4, backpointer=uc2] Here op2 was acknowledged by op3, op3 was applied before op1' (linearized by WAL). All nodes having uc4 must have op1' because uc4 cannot be get earlier than prepare stage while prepare stage happens after all updates so *op1' happens before uc4* regardless Tx2 was committed or rolled back. This means *op2 happens before uc4* (uc4 cannot be earlier op2 on any node because on Node2 op2 was already finished (acknowledged by op3) when op1' happens) That was my idea which easy to proof. You used a different approach, but yes, It has to work. чт, 29 нояб. 2018 г. в 22:19, Vladimir Ozerov : > "If more recent WAL records will contain *ALL* updates of the transaction" > -> "More recent WAL records will contain *ALL* updates of the transaction" > > On Thu, Nov 29, 2018 at 10:15 PM Vladimir Ozerov > wrote: > > > Igor, > > > > Yes, I tried to draw different configurations, and it really seems to > > work, despite of being very hard to proof due to non-inituitive HB edges. > > So let me try to spell the algorithm once again to make sure that we are > on > > the same page here. > > > > 1) There are two nodes - primary (P) and backup (B) > > 2) There are three type of events: small transactions which possibly > > increments update counter (ucX), one long active transaction which is > split > > into multiple operations (opX), and checkpoints (cpX) > > 3) Every node always has current update counter. When transaction commits > > it may or may not shift this counter further depending on whether there > are > > holes behind. But we have a strict rule that it always grow. Higher > > coutners synchrnoizes with smaller. Possible cases: > > uc1uc2uc3 > > uc1uc3--- // uc2 missing due to reorder, but is is ok > > > > 4) Operations within a single transaction is always applied sequentially, > > and hence also have HB edge: > > op1op2op3 > > > > 5) When transaction operation happens, we save in memory current update > > counter available at this moment. I.e. we have a map from transaction ID > to > > update counter which was relevant by the time last *completed* operation > > *started*. This is very important thing - we remember the counter when > > operation starts, but update the map only when it finishes. This is > needed > > for situation when update counter is bumber in the middle of a long > > operation. > > uc1op1op2uc2uc3op3 > > | || > >uc1uc1 uc3 > > > > state: tx1 -> op3 -> uc3 > > > > 6) Whenever checkpoint occurs, we save two counters with: "current" and > > "backpointer". The latter is the smallest update counter associated with > > active transactions. If there are no active transactions, current update > > counter is used. > > > > Example 1: no active transactions. > > uc1cp1 > > ^ | > > > > > > state: cp1 [current=uc1, backpointer=uc1] > > > > Example 2: one active transaction: > > --- > > | | > > uc1op1uc2op2op3uc3cp1 > >^ | > >-- > > > > state: tx1 -> op3 -> uc2 > >cp1 [current=uc3, backpointer=uc2] > > > > 7) Historical rebalance: > > 7.1) Demander finds latest checkpoint, get it's backpointer and sends it > > to supplier. > > 7.2) Supplier finds earliest checkpoint where [supplier(current) <= > > demander(backpointer)] > > 7.3) Supplier reads checkpoint backpointer and finds associated WAL > > record. This is where
Re: Historical rebalance
Igor, Yes, I tried to draw different configurations, and it really seems to work, despite of being very hard to proof due to non-inituitive HB edges. So let me try to spell the algorithm once again to make sure that we are on the same page here. 1) There are two nodes - primary (P) and backup (B) 2) There are three type of events: small transactions which possibly increments update counter (ucX), one long active transaction which is split into multiple operations (opX), and checkpoints (cpX) 3) Every node always has current update counter. When transaction commits it may or may not shift this counter further depending on whether there are holes behind. But we have a strict rule that it always grow. Higher coutners synchrnoizes with smaller. Possible cases: uc1uc2uc3 uc1uc3--- // uc2 missing due to reorder, but is is ok 4) Operations within a single transaction is always applied sequentially, and hence also have HB edge: op1op2op3 5) When transaction operation happens, we save in memory current update counter available at this moment. I.e. we have a map from transaction ID to update counter which was relevant by the time last *completed* operation *started*. This is very important thing - we remember the counter when operation starts, but update the map only when it finishes. This is needed for situation when update counter is bumber in the middle of a long operation. uc1op1op2uc2uc3op3 | || uc1uc1 uc3 state: tx1 -> op3 -> uc3 6) Whenever checkpoint occurs, we save two counters with: "current" and "backpointer". The latter is the smallest update counter associated with active transactions. If there are no active transactions, current update counter is used. Example 1: no active transactions. uc1cp1 ^ | state: cp1 [current=uc1, backpointer=uc1] Example 2: one active transaction: --- | | uc1op1uc2op2op3uc3cp1 ^ | -- state: tx1 -> op3 -> uc2 cp1 [current=uc3, backpointer=uc2] 7) Historical rebalance: 7.1) Demander finds latest checkpoint, get it's backpointer and sends it to supplier. 7.2) Supplier finds earliest checkpoint where [supplier(current) <= demander(backpointer)] 7.3) Supplier reads checkpoint backpointer and finds associated WAL record. This is where we start. So in terms of WAL we have: supplier[uc_backpointer <- cp(uc_current <= demanter_uc_backpointer)] <- demander[uc_backpointer <- cp(last)] Now the most important - why it works :-) 1) Transaction opeartions are sequential, so at the time of crash nodes are *at most one operation ahead *each other 2) Demander goes to the past and finds update counter which was current at the time of last TX completed operation 3) Supplier goes to the closest checkpoint in the past where this update counter either doesn't exist or just appeared 4) Transaction cannot be committed on supplier at this checkpoint, as it would violate UC happens-before rule 5) Tranasction may have not started yet on supplier at this point. If more recent WAL records will contain *ALL* updates of the transaction 6) Transaction may exist on supplier at this checkpoint. Thanks to p.1 we must skip at most one operation. Jump back through supplier's checkpoint backpointer is guaranteed to do this. Igor, do we have the same understanding here? Vladimir. On Thu, Nov 29, 2018 at 2:47 PM Seliverstov Igor wrote: > Ivan, > > different transactions may be applied in different order on backup nodes. > That's why we need an active tx set > and some sorting by their update times. The idea is to identify a point in > time which starting from we may lost some updates. > This point: >1) is the last acknowledged by all backups (including possible further > demander) update on timeline; >2) have a specific update counter (aka back-counter) which we going to > start iteration from. > > After additional thinking on, I've identified a rule: > > There is two fences: > 1) update counter (UC) - this means that all updates, with less UC than > applied one, was applied on a node, having this UC. > 2) update in scope of TX - all updates are applied one by one > sequentially, this means that the fact of update guaranties the previous > update (statement) was finished on all TX participants. > > Сombining them, we can say the next: > > All updates, that was acknowledged at the time the last update of tx, which > updated UC, applied, are guaranteed to be presented on a node having such > UC > > We can use this rule to find an iterator start pointer. > > ср, 28 нояб. 2018 г. в
Re: Historical rebalance
"If more recent WAL records will contain *ALL* updates of the transaction" -> "More recent WAL records will contain *ALL* updates of the transaction" On Thu, Nov 29, 2018 at 10:15 PM Vladimir Ozerov wrote: > Igor, > > Yes, I tried to draw different configurations, and it really seems to > work, despite of being very hard to proof due to non-inituitive HB edges. > So let me try to spell the algorithm once again to make sure that we are on > the same page here. > > 1) There are two nodes - primary (P) and backup (B) > 2) There are three type of events: small transactions which possibly > increments update counter (ucX), one long active transaction which is split > into multiple operations (opX), and checkpoints (cpX) > 3) Every node always has current update counter. When transaction commits > it may or may not shift this counter further depending on whether there are > holes behind. But we have a strict rule that it always grow. Higher > coutners synchrnoizes with smaller. Possible cases: > uc1uc2uc3 > uc1uc3--- // uc2 missing due to reorder, but is is ok > > 4) Operations within a single transaction is always applied sequentially, > and hence also have HB edge: > op1op2op3 > > 5) When transaction operation happens, we save in memory current update > counter available at this moment. I.e. we have a map from transaction ID to > update counter which was relevant by the time last *completed* operation > *started*. This is very important thing - we remember the counter when > operation starts, but update the map only when it finishes. This is needed > for situation when update counter is bumber in the middle of a long > operation. > uc1op1op2uc2uc3op3 > | || >uc1uc1 uc3 > > state: tx1 -> op3 -> uc3 > > 6) Whenever checkpoint occurs, we save two counters with: "current" and > "backpointer". The latter is the smallest update counter associated with > active transactions. If there are no active transactions, current update > counter is used. > > Example 1: no active transactions. > uc1cp1 > ^ | > > > state: cp1 [current=uc1, backpointer=uc1] > > Example 2: one active transaction: > --- > | | > uc1op1----uc2op2op3uc3cp1 >^ | >-- > > state: tx1 -> op3 -> uc2 >cp1 [current=uc3, backpointer=uc2] > > 7) Historical rebalance: > 7.1) Demander finds latest checkpoint, get it's backpointer and sends it > to supplier. > 7.2) Supplier finds earliest checkpoint where [supplier(current) <= > demander(backpointer)] > 7.3) Supplier reads checkpoint backpointer and finds associated WAL > record. This is where we start. > > So in terms of WAL we have: supplier[uc_backpointer <- cp(uc_current <= > demanter_uc_backpointer)] <- demander[uc_backpointer <- cp(last)] > > Now the most important - why it works :-) > 1) Transaction opeartions are sequential, so at the time of crash nodes > are *at most one operation ahead *each other > 2) Demander goes to the past and finds update counter which was current at > the time of last TX completed operation > 3) Supplier goes to the closest checkpoint in the past where this update > counter either doesn't exist or just appeared > 4) Transaction cannot be committed on supplier at this checkpoint, as it > would violate UC happens-before rule > 5) Tranasction may have not started yet on supplier at this point. If more > recent WAL records will contain *ALL* updates of the transaction > 6) Transaction may exist on supplier at this checkpoint. Thanks to p.1 we > must skip at most one operation. Jump back through supplier's checkpoint > backpointer is guaranteed to do this. > > Igor, do we have the same understanding here? > > Vladimir. > > On Thu, Nov 29, 2018 at 2:47 PM Seliverstov Igor > wrote: > >> Ivan, >> >> different transactions may be applied in different order on backup nodes. >> That's why we need an active tx set >> and some sorting by their update times. The idea is to identify a point in >> time which starting from we may lost some updates. >> This point: >>1) is the last acknowledged by all backups (including possible further >> demander) update on timeline; >>2) have a specific update counter (aka back-counter) which we going to >> start iteration from. >> >> After additional thinking on, I've ide
Re: Historical rebalance
need any additional info in demand message. > Start point > > > > > can be easily determined using stored WAL "back-pointer". > > > > > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess > > > > > > > > > > > > > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov < > voze...@gridgain.com>: > > > > > > > > > >> Igor, > > > > >> > > > > >> Could you please elaborate - what is the whole set of information > we are > > > > >> going to save at checkpoint time? From what I understand this > should be: > > > > >> 1) List of active transactions with WAL pointers of their first > writes > > > > >> 2) List of prepared transactions with their update counters > > > > >> 3) Partition counter low watermark (LWM) - the smallest partition > counter > > > > >> before which there are no prepared transactions. > > > > >> > > > > >> And the we send to supplier node a message: "Give me all updates > starting > > > > >> from that LWM plus data for that transactions which were active > when I > > > > >> failed". > > > > >> > > > > >> Am I right? > > > > >> > > > > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor < > gvvinbl...@gmail.com> > > > > >> wrote: > > > > >> > > > > >> > Hi Igniters, > > > > >> > > > > > >> > Currently I’m working on possible approaches how to implement > historical > > > > >> > rebalance (delta rebalance using WAL iterator) over MVCC caches. > > > > >> > > > > > >> > The main difficulty is that MVCC writes changes on tx active > phase while > > > > >> > partition update version, aka update counter, is being applied > on tx > > > > >> > finish. This means we cannot start iteration over WAL right > from the > > > > >> > pointer where the update counter updated, but should include > updates, > > > > >> which > > > > >> > the transaction that updated the counter did. > > > > >> > > > > > >> > These updates may be much earlier than the point where the > update > > > > >> counter > > > > >> > was updated, so we have to be able to identify the point where > the first > > > > >> > update happened. > > > > >> > > > > > >> > The proposed approach includes: > > > > >> > > > > > >> > 1) preserve list of active txs, sorted by the time of their > first update > > > > >> > (using WAL ptr of first WAL record in tx) > > > > >> > > > > > >> > 2) persist this list on each checkpoint (together with TxLog for > > > > >> example) > > > > >> > > > > > >> > 4) send whole active tx list (transactions which were in active > state at > > > > >> > the time the node was crushed, empty list in case of graceful > node > > > > >> stop) as > > > > >> > a part of partition demand message. > > > > >> > > > > > >> > 4) find a checkpoint where the earliest tx exists in persisted > txs and > > > > >> use > > > > >> > saved WAL ptr as a start point or apply current approach in > case the > > > > >> active > > > > >> > tx list (sent on previous step) is empty > > > > >> > > > > > >> > 5) start iteration. > > > > >> > > > > > >> > Your thoughts? > > > > >> > > > > > >> > Regards, > > > > >> > Igor > > > > >> > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Ivan Pavlukhin > > > > > > > > -- > > Best regards, > > Ivan Pavlukhin > > > > -- > Best regards, > Ivan Pavlukhin >
Re: Historical rebalance
Guys, Another one idea. We can introduce additional update counter which is incremented by MVCC transactions right after executing operation (like is done for classic transactions). And we can use that counter for searching needed WAL records. Can it did the trick? P.S. Mentally I am trying to separate facilities providing transactions and durability. And it seems to me that those facilities are in different dimensions. ср, 28 нояб. 2018 г. в 16:26, Павлухин Иван : > > Sorry, if it was stated that a SINGLE transaction updates are applied > in a same order on all replicas then I have no questions so far. I > thought about reordering updates coming from different transactions. > > I have not got why we can assume that reordering is not possible. What > have I missed? > ср, 28 нояб. 2018 г. в 13:26, Павлухин Иван : > > > > Hi, > > > > Regarding Vladimir's new idea. > > > We assume that transaction can be represented as a set of independent > > > operations, which are applied in the same order on both primary and > > > backup nodes. > > I have not got why we can assume that reordering is not possible. What > > have I missed? > > вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor : > > > > > > Vladimir, > > > > > > I think I got your point, > > > > > > It should work if we do the next: > > > introduce two structures: active list (txs) and candidate list (updCntr -> > > > txn pairs) > > > > > > Track active txs, mapping them to actual update counter at update time. > > > On each next update put update counter, associated with previous update, > > > into a candidates list possibly overwrite existing value (checking txn) > > > On tx finish remove tx from active list only if appropriate update counter > > > (associated with finished tx) is applied. > > > On update counter update set the minimal update counter from the > > > candidates > > > list as a back-counter, clear the candidate list and remove an associated > > > tx from the active list if present. > > > Use back-counter instead of actual update counter in demand message. > > > > > > вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor : > > > > > > > Ivan, > > > > > > > > 1) The list is saved on each checkpoint, wholly (all transactions in > > > > active state at checkpoint begin). > > > > We need whole the list to get oldest transaction because after > > > > the previous oldest tx finishes, we need to get the following one. > > > > > > > > 2) I guess there is a description of how persistent storage works and > > > > how > > > > it restores [1] > > > > > > > > Vladimir, > > > > > > > > the whole list of what we going to store on checkpoint (updated): > > > > 1) Partition counter low watermark (LWM) > > > > 2) WAL pointer of earliest active transaction write to partition at the > > > > time the checkpoint have started > > > > 3) List of prepared txs with acquired partition counters (which were > > > > acquired but not applied yet) > > > > > > > > This way we don't need any additional info in demand message. Start > > > > point > > > > can be easily determined using stored WAL "back-pointer". > > > > > > > > [1] > > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess > > > > > > > > > > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov : > > > > > > > >> Igor, > > > >> > > > >> Could you please elaborate - what is the whole set of information we > > > >> are > > > >> going to save at checkpoint time? From what I understand this should > > > >> be: > > > >> 1) List of active transactions with WAL pointers of their first writes > > > >> 2) List of prepared transactions with their update counters > > > >> 3) Partition counter low watermark (LWM) - the smallest partition > > > >> counter > > > >> before which there are no prepared transactions. > > > >> > > > >> And the we send to supplier node a message: "Give me all updates > > > >> starting > > > >> from that LWM plus data for that transactions which were active when I > > > >> failed". > > > >> &
Re: Historical rebalance
Sorry, if it was stated that a SINGLE transaction updates are applied in a same order on all replicas then I have no questions so far. I thought about reordering updates coming from different transactions. > I have not got why we can assume that reordering is not possible. What have I missed? ср, 28 нояб. 2018 г. в 13:26, Павлухин Иван : > > Hi, > > Regarding Vladimir's new idea. > > We assume that transaction can be represented as a set of independent > > operations, which are applied in the same order on both primary and backup > > nodes. > I have not got why we can assume that reordering is not possible. What > have I missed? > вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor : > > > > Vladimir, > > > > I think I got your point, > > > > It should work if we do the next: > > introduce two structures: active list (txs) and candidate list (updCntr -> > > txn pairs) > > > > Track active txs, mapping them to actual update counter at update time. > > On each next update put update counter, associated with previous update, > > into a candidates list possibly overwrite existing value (checking txn) > > On tx finish remove tx from active list only if appropriate update counter > > (associated with finished tx) is applied. > > On update counter update set the minimal update counter from the candidates > > list as a back-counter, clear the candidate list and remove an associated > > tx from the active list if present. > > Use back-counter instead of actual update counter in demand message. > > > > вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor : > > > > > Ivan, > > > > > > 1) The list is saved on each checkpoint, wholly (all transactions in > > > active state at checkpoint begin). > > > We need whole the list to get oldest transaction because after > > > the previous oldest tx finishes, we need to get the following one. > > > > > > 2) I guess there is a description of how persistent storage works and how > > > it restores [1] > > > > > > Vladimir, > > > > > > the whole list of what we going to store on checkpoint (updated): > > > 1) Partition counter low watermark (LWM) > > > 2) WAL pointer of earliest active transaction write to partition at the > > > time the checkpoint have started > > > 3) List of prepared txs with acquired partition counters (which were > > > acquired but not applied yet) > > > > > > This way we don't need any additional info in demand message. Start point > > > can be easily determined using stored WAL "back-pointer". > > > > > > [1] > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess > > > > > > > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov : > > > > > >> Igor, > > >> > > >> Could you please elaborate - what is the whole set of information we are > > >> going to save at checkpoint time? From what I understand this should be: > > >> 1) List of active transactions with WAL pointers of their first writes > > >> 2) List of prepared transactions with their update counters > > >> 3) Partition counter low watermark (LWM) - the smallest partition counter > > >> before which there are no prepared transactions. > > >> > > >> And the we send to supplier node a message: "Give me all updates starting > > >> from that LWM plus data for that transactions which were active when I > > >> failed". > > >> > > >> Am I right? > > >> > > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor > > >> wrote: > > >> > > >> > Hi Igniters, > > >> > > > >> > Currently I’m working on possible approaches how to implement > > >> > historical > > >> > rebalance (delta rebalance using WAL iterator) over MVCC caches. > > >> > > > >> > The main difficulty is that MVCC writes changes on tx active phase > > >> > while > > >> > partition update version, aka update counter, is being applied on tx > > >> > finish. This means we cannot start iteration over WAL right from the > > >> > pointer where the update counter updated, but should include updates, > > >> which > > >> > the transaction that updated the counter did. > > >> > > > >> > These updates may be m
Re: Historical rebalance
Hi, Regarding Vladimir's new idea. > We assume that transaction can be represented as a set of independent > operations, which are applied in the same order on both primary and backup > nodes. I have not got why we can assume that reordering is not possible. What have I missed? вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor : > > Vladimir, > > I think I got your point, > > It should work if we do the next: > introduce two structures: active list (txs) and candidate list (updCntr -> > txn pairs) > > Track active txs, mapping them to actual update counter at update time. > On each next update put update counter, associated with previous update, > into a candidates list possibly overwrite existing value (checking txn) > On tx finish remove tx from active list only if appropriate update counter > (associated with finished tx) is applied. > On update counter update set the minimal update counter from the candidates > list as a back-counter, clear the candidate list and remove an associated > tx from the active list if present. > Use back-counter instead of actual update counter in demand message. > > вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor : > > > Ivan, > > > > 1) The list is saved on each checkpoint, wholly (all transactions in > > active state at checkpoint begin). > > We need whole the list to get oldest transaction because after > > the previous oldest tx finishes, we need to get the following one. > > > > 2) I guess there is a description of how persistent storage works and how > > it restores [1] > > > > Vladimir, > > > > the whole list of what we going to store on checkpoint (updated): > > 1) Partition counter low watermark (LWM) > > 2) WAL pointer of earliest active transaction write to partition at the > > time the checkpoint have started > > 3) List of prepared txs with acquired partition counters (which were > > acquired but not applied yet) > > > > This way we don't need any additional info in demand message. Start point > > can be easily determined using stored WAL "back-pointer". > > > > [1] > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess > > > > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov : > > > >> Igor, > >> > >> Could you please elaborate - what is the whole set of information we are > >> going to save at checkpoint time? From what I understand this should be: > >> 1) List of active transactions with WAL pointers of their first writes > >> 2) List of prepared transactions with their update counters > >> 3) Partition counter low watermark (LWM) - the smallest partition counter > >> before which there are no prepared transactions. > >> > >> And the we send to supplier node a message: "Give me all updates starting > >> from that LWM plus data for that transactions which were active when I > >> failed". > >> > >> Am I right? > >> > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor > >> wrote: > >> > >> > Hi Igniters, > >> > > >> > Currently I’m working on possible approaches how to implement historical > >> > rebalance (delta rebalance using WAL iterator) over MVCC caches. > >> > > >> > The main difficulty is that MVCC writes changes on tx active phase while > >> > partition update version, aka update counter, is being applied on tx > >> > finish. This means we cannot start iteration over WAL right from the > >> > pointer where the update counter updated, but should include updates, > >> which > >> > the transaction that updated the counter did. > >> > > >> > These updates may be much earlier than the point where the update > >> counter > >> > was updated, so we have to be able to identify the point where the first > >> > update happened. > >> > > >> > The proposed approach includes: > >> > > >> > 1) preserve list of active txs, sorted by the time of their first update > >> > (using WAL ptr of first WAL record in tx) > >> > > >> > 2) persist this list on each checkpoint (together with TxLog for > >> example) > >> > > >> > 4) send whole active tx list (transactions which were in active state at > >> > the time the node was crushed, empty list in case of graceful node > >> stop) as > >> > a part of partition demand message. > >> > > >> > 4) find a checkpoint where the earliest tx exists in persisted txs and > >> use > >> > saved WAL ptr as a start point or apply current approach in case the > >> active > >> > tx list (sent on previous step) is empty > >> > > >> > 5) start iteration. > >> > > >> > Your thoughts? > >> > > >> > Regards, > >> > Igor > >> > > -- Best regards, Ivan Pavlukhin
Re: Historical rebalance
Vladimir, I think I got your point, It should work if we do the next: introduce two structures: active list (txs) and candidate list (updCntr -> txn pairs) Track active txs, mapping them to actual update counter at update time. On each next update put update counter, associated with previous update, into a candidates list possibly overwrite existing value (checking txn) On tx finish remove tx from active list only if appropriate update counter (associated with finished tx) is applied. On update counter update set the minimal update counter from the candidates list as a back-counter, clear the candidate list and remove an associated tx from the active list if present. Use back-counter instead of actual update counter in demand message. вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor : > Ivan, > > 1) The list is saved on each checkpoint, wholly (all transactions in > active state at checkpoint begin). > We need whole the list to get oldest transaction because after > the previous oldest tx finishes, we need to get the following one. > > 2) I guess there is a description of how persistent storage works and how > it restores [1] > > Vladimir, > > the whole list of what we going to store on checkpoint (updated): > 1) Partition counter low watermark (LWM) > 2) WAL pointer of earliest active transaction write to partition at the > time the checkpoint have started > 3) List of prepared txs with acquired partition counters (which were > acquired but not applied yet) > > This way we don't need any additional info in demand message. Start point > can be easily determined using stored WAL "back-pointer". > > [1] > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov : > >> Igor, >> >> Could you please elaborate - what is the whole set of information we are >> going to save at checkpoint time? From what I understand this should be: >> 1) List of active transactions with WAL pointers of their first writes >> 2) List of prepared transactions with their update counters >> 3) Partition counter low watermark (LWM) - the smallest partition counter >> before which there are no prepared transactions. >> >> And the we send to supplier node a message: "Give me all updates starting >> from that LWM plus data for that transactions which were active when I >> failed". >> >> Am I right? >> >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor >> wrote: >> >> > Hi Igniters, >> > >> > Currently I’m working on possible approaches how to implement historical >> > rebalance (delta rebalance using WAL iterator) over MVCC caches. >> > >> > The main difficulty is that MVCC writes changes on tx active phase while >> > partition update version, aka update counter, is being applied on tx >> > finish. This means we cannot start iteration over WAL right from the >> > pointer where the update counter updated, but should include updates, >> which >> > the transaction that updated the counter did. >> > >> > These updates may be much earlier than the point where the update >> counter >> > was updated, so we have to be able to identify the point where the first >> > update happened. >> > >> > The proposed approach includes: >> > >> > 1) preserve list of active txs, sorted by the time of their first update >> > (using WAL ptr of first WAL record in tx) >> > >> > 2) persist this list on each checkpoint (together with TxLog for >> example) >> > >> > 4) send whole active tx list (transactions which were in active state at >> > the time the node was crushed, empty list in case of graceful node >> stop) as >> > a part of partition demand message. >> > >> > 4) find a checkpoint where the earliest tx exists in persisted txs and >> use >> > saved WAL ptr as a start point or apply current approach in case the >> active >> > tx list (sent on previous step) is empty >> > >> > 5) start iteration. >> > >> > Your thoughts? >> > >> > Regards, >> > Igor >> >
Re: Historical rebalance
Ivan, 1) The list is saved on each checkpoint, wholly (all transactions in active state at checkpoint begin). We need whole the list to get oldest transaction because after the previous oldest tx finishes, we need to get the following one. 2) I guess there is a description of how persistent storage works and how it restores [1] Vladimir, the whole list of what we going to store on checkpoint (updated): 1) Partition counter low watermark (LWM) 2) WAL pointer of earliest active transaction write to partition at the time the checkpoint have started 3) List of prepared txs with acquired partition counters (which were acquired but not applied yet) This way we don't need any additional info in demand message. Start point can be easily determined using stored WAL "back-pointer". [1] https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov : > Igor, > > Could you please elaborate - what is the whole set of information we are > going to save at checkpoint time? From what I understand this should be: > 1) List of active transactions with WAL pointers of their first writes > 2) List of prepared transactions with their update counters > 3) Partition counter low watermark (LWM) - the smallest partition counter > before which there are no prepared transactions. > > And the we send to supplier node a message: "Give me all updates starting > from that LWM plus data for that transactions which were active when I > failed". > > Am I right? > > On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor > wrote: > > > Hi Igniters, > > > > Currently I’m working on possible approaches how to implement historical > > rebalance (delta rebalance using WAL iterator) over MVCC caches. > > > > The main difficulty is that MVCC writes changes on tx active phase while > > partition update version, aka update counter, is being applied on tx > > finish. This means we cannot start iteration over WAL right from the > > pointer where the update counter updated, but should include updates, > which > > the transaction that updated the counter did. > > > > These updates may be much earlier than the point where the update counter > > was updated, so we have to be able to identify the point where the first > > update happened. > > > > The proposed approach includes: > > > > 1) preserve list of active txs, sorted by the time of their first update > > (using WAL ptr of first WAL record in tx) > > > > 2) persist this list on each checkpoint (together with TxLog for example) > > > > 4) send whole active tx list (transactions which were in active state at > > the time the node was crushed, empty list in case of graceful node stop) > as > > a part of partition demand message. > > > > 4) find a checkpoint where the earliest tx exists in persisted txs and > use > > saved WAL ptr as a start point or apply current approach in case the > active > > tx list (sent on previous step) is empty > > > > 5) start iteration. > > > > Your thoughts? > > > > Regards, > > Igor >
Re: Historical rebalance
Just thought of this a bit more. I we will look for start of long-running transaction in WAL we may go back too far to the past only to get few entries. What if we consider slightly different approach? We assume that transaction can be represented as a set of independent operations, which are applied in the same order on both primary and backup nodes. Then we can do the following: 1) When next operation is finished, we assign transaction LWM of the last checkpoint. I.e. we maintain a map [Txn -> last_op_LWM]. 2) If "last_op_LWM" of transaction is not changed between two subsequent checkpoints, we assign it to special value "UP_TO_DATE". Now at the time of checkpoint we get minimal value among current partition LWM and active transaction LWMs, ignoring "UP_TO_DATE" values. Resulting value is the final partition counter which we will request from supplier node. We save it to checkpoint record. When WAL on demander is unwound from this value, then it is guaranteed to contain all missing data of demanders's active transactions. I.e. instead of tracking the whole active transaction, we track part of transaction which is possibly missing on a node. Will that work? On Tue, Nov 27, 2018 at 11:19 AM Vladimir Ozerov wrote: > Igor, > > Could you please elaborate - what is the whole set of information we are > going to save at checkpoint time? From what I understand this should be: > 1) List of active transactions with WAL pointers of their first writes > 2) List of prepared transactions with their update counters > 3) Partition counter low watermark (LWM) - the smallest partition counter > before which there are no prepared transactions. > > And the we send to supplier node a message: "Give me all updates starting > from that LWM plus data for that transactions which were active when I > failed". > > Am I right? > > On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor > wrote: > >> Hi Igniters, >> >> Currently I’m working on possible approaches how to implement historical >> rebalance (delta rebalance using WAL iterator) over MVCC caches. >> >> The main difficulty is that MVCC writes changes on tx active phase while >> partition update version, aka update counter, is being applied on tx >> finish. This means we cannot start iteration over WAL right from the >> pointer where the update counter updated, but should include updates, which >> the transaction that updated the counter did. >> >> These updates may be much earlier than the point where the update counter >> was updated, so we have to be able to identify the point where the first >> update happened. >> >> The proposed approach includes: >> >> 1) preserve list of active txs, sorted by the time of their first update >> (using WAL ptr of first WAL record in tx) >> >> 2) persist this list on each checkpoint (together with TxLog for example) >> >> 4) send whole active tx list (transactions which were in active state at >> the time the node was crushed, empty list in case of graceful node stop) as >> a part of partition demand message. >> >> 4) find a checkpoint where the earliest tx exists in persisted txs and >> use saved WAL ptr as a start point or apply current approach in case the >> active tx list (sent on previous step) is empty >> >> 5) start iteration. >> >> Your thoughts? >> >> Regards, >> Igor > >
Re: Historical rebalance
Igor, Could you please elaborate - what is the whole set of information we are going to save at checkpoint time? From what I understand this should be: 1) List of active transactions with WAL pointers of their first writes 2) List of prepared transactions with their update counters 3) Partition counter low watermark (LWM) - the smallest partition counter before which there are no prepared transactions. And the we send to supplier node a message: "Give me all updates starting from that LWM plus data for that transactions which were active when I failed". Am I right? On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor wrote: > Hi Igniters, > > Currently I’m working on possible approaches how to implement historical > rebalance (delta rebalance using WAL iterator) over MVCC caches. > > The main difficulty is that MVCC writes changes on tx active phase while > partition update version, aka update counter, is being applied on tx > finish. This means we cannot start iteration over WAL right from the > pointer where the update counter updated, but should include updates, which > the transaction that updated the counter did. > > These updates may be much earlier than the point where the update counter > was updated, so we have to be able to identify the point where the first > update happened. > > The proposed approach includes: > > 1) preserve list of active txs, sorted by the time of their first update > (using WAL ptr of first WAL record in tx) > > 2) persist this list on each checkpoint (together with TxLog for example) > > 4) send whole active tx list (transactions which were in active state at > the time the node was crushed, empty list in case of graceful node stop) as > a part of partition demand message. > > 4) find a checkpoint where the earliest tx exists in persisted txs and use > saved WAL ptr as a start point or apply current approach in case the active > tx list (sent on previous step) is empty > > 5) start iteration. > > Your thoughts? > > Regards, > Igor
Re: Historical rebalance
Igor, Could you please clarify some points? > 1) preserve list of active txs, sorted by the time of their first update > (using WAL ptr of first WAL record in tx) Is this list maintained per transaction or per checkpoint (or per something else)? Why can't we track only oldest active transaction instead of whole active list? > 4) find a checkpoint where the earliest tx exists in persisted txs and use > saved WAL ptr as a start point or apply current approach in case the active > tx list (sent on previous step) is empty What is the base storage state on a demanding node to which we are applying WAL records to? I mean the state before applying WAL records. Do we apply all records simply one by one or filter out some of them? пт, 23 нояб. 2018 г. в 11:22, Seliverstov Igor : > > Hi Igniters, > > Currently I’m working on possible approaches how to implement historical > rebalance (delta rebalance using WAL iterator) over MVCC caches. > > The main difficulty is that MVCC writes changes on tx active phase while > partition update version, aka update counter, is being applied on tx finish. > This means we cannot start iteration over WAL right from the pointer where > the update counter updated, but should include updates, which the transaction > that updated the counter did. > > These updates may be much earlier than the point where the update counter was > updated, so we have to be able to identify the point where the first update > happened. > > The proposed approach includes: > > 1) preserve list of active txs, sorted by the time of their first update > (using WAL ptr of first WAL record in tx) > > 2) persist this list on each checkpoint (together with TxLog for example) > > 4) send whole active tx list (transactions which were in active state at the > time the node was crushed, empty list in case of graceful node stop) as a > part of partition demand message. > > 4) find a checkpoint where the earliest tx exists in persisted txs and use > saved WAL ptr as a start point or apply current approach in case the active > tx list (sent on previous step) is empty > > 5) start iteration. > > Your thoughts? > > Regards, > Igor -- Best regards, Ivan Pavlukhin
Historical rebalance
Hi Igniters, Currently I’m working on possible approaches how to implement historical rebalance (delta rebalance using WAL iterator) over MVCC caches. The main difficulty is that MVCC writes changes on tx active phase while partition update version, aka update counter, is being applied on tx finish. This means we cannot start iteration over WAL right from the pointer where the update counter updated, but should include updates, which the transaction that updated the counter did. These updates may be much earlier than the point where the update counter was updated, so we have to be able to identify the point where the first update happened. The proposed approach includes: 1) preserve list of active txs, sorted by the time of their first update (using WAL ptr of first WAL record in tx) 2) persist this list on each checkpoint (together with TxLog for example) 4) send whole active tx list (transactions which were in active state at the time the node was crushed, empty list in case of graceful node stop) as a part of partition demand message. 4) find a checkpoint where the earliest tx exists in persisted txs and use saved WAL ptr as a start point or apply current approach in case the active tx list (sent on previous step) is empty 5) start iteration. Your thoughts? Regards, Igor
[jira] [Created] (IGNITE-10117) Node is mistakenly excluded from history suppliers preventing historical rebalance.
Alexei Scherbakov created IGNITE-10117: -- Summary: Node is mistakenly excluded from history suppliers preventing historical rebalance. Key: IGNITE-10117 URL: https://issues.apache.org/jira/browse/IGNITE-10117 Project: Ignite Issue Type: Bug Reporter: Alexei Scherbakov Fix For: 2.8 This is because org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#reserveHistoryForExchange is called before org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#beforeExchange, which restores correct partition state. {noformat} public void testHistory() throws Exception { IgniteEx crd = startGrid(0); startGrid(1); crd.cluster().active(true); awaitPartitionMapExchange(); int part = 0; List keys = loadDataToPartition(part, DEFAULT_CACHE_NAME, 100, 0, 1); forceCheckpoint(); // Prevent IGNITE-10088 stopAllGrids(); awaitPartitionMapExchange(); List keys1 = loadDataToPartition(part, DEFAULT_CACHE_NAME, 100, 100, 1); startGrid(0); startGrid(1); awaitPartitionMapExchange(); // grid0 will not provide history. } {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-9827) Assertion error due historical rebalance
ARomantsov created IGNITE-9827: -- Summary: Assertion error due historical rebalance Key: IGNITE-9827 URL: https://issues.apache.org/jira/browse/IGNITE-9827 Project: Ignite Issue Type: Bug Components: persistence Affects Versions: 2.7 Reporter: ARomantsov Fix For: 2.8 I work with next situation 1) Start two nodes with '-DIGNITE_PDS_WAL_REBALANCE_THRESHOLD=0', 2) Preload 3) Stop node 2 4) Load 5) Corrupt all wal archive file in node 1 5) Start node 2 And found assertion in log coordinator - is it ok? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (IGNITE-8946) AssertionError can occur during reservation of WAL history for historical rebalance
Ivan Rakov created IGNITE-8946: -- Summary: AssertionError can occur during reservation of WAL history for historical rebalance Key: IGNITE-8946 URL: https://issues.apache.org/jira/browse/IGNITE-8946 Project: Ignite Issue Type: Bug Reporter: Ivan Rakov Attempt to release WAL after exchange may fail with AssertionError. Seems like we have a bug and may try to release more WAL segments than we have reserved: {noformat} java.lang.AssertionError: null at org.apache.ignite.internal.processors.cache.persistence.wal.SegmentReservationStorage.release(SegmentReservationStorage.java:54) - locked <0x1c12> (a org.apache.ignite.internal.processors.cache.persistence.wal.SegmentReservationStorage) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.release(FileWriteAheadLogManager.java:862) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.releaseHistoryForExchange(GridCacheDatabaseSharedManager.java:1691) - locked <0x1c17> (a org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:1751) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:2858) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:2591) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2283) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:129) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2140) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2128) at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383) at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2128) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1580) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1000(GridCachePartitionExchangeManager.java:138) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:345) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:325) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2848) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2827) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManag
[GitHub] ignite pull request #3791: IGNITE-8116 Historical rebalance fixes
Github user Jokser closed the pull request at: https://github.com/apache/ignite/pull/3791 ---
[jira] [Created] (IGNITE-8390) WAL historical rebalance is not able to process cache.remove() updates
Pavel Kovalenko created IGNITE-8390: --- Summary: WAL historical rebalance is not able to process cache.remove() updates Key: IGNITE-8390 URL: https://issues.apache.org/jira/browse/IGNITE-8390 Project: Ignite Issue Type: Bug Components: cache Affects Versions: 2.4 Reporter: Pavel Kovalenko Assignee: Pavel Kovalenko WAL historical rebalance fails on supplier when process entry remove with following assertion: {noformat} java.lang.AssertionError: GridCacheEntryInfo [key=KeyCacheObjectImpl [part=-1, val=2, hasValBytes=true], cacheId=94416770, val=null, ttl=0, expireTime=0, ver=GridCacheVersion [topVer=136155335, order=1524675346187, nodeOrder=1], isNew=false, deleted=false] at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplyMessage.addEntry0(GridDhtPartitionSupplyMessage.java:220) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:381) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579) at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99) at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752) at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125) at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {noformat} Obviously this assertion will work correctly only for full rebalance. We should either soft assertion for historical rebalance case or disable it. In case of disabled assertion everything works well and rebalance finished properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] ignite pull request #3791: IGNITE-8116 Historical rebalance fixes
GitHub user Jokser opened a pull request: https://github.com/apache/ignite/pull/3791 IGNITE-8116 Historical rebalance fixes You can merge this pull request into a Git repository by running: $ git pull https://github.com/gridgain/apache-ignite ignite-8116-8122 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/ignite/pull/3791.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3791 commit 604ee719b304d0b4cf4caabaa6fa16b5a980e04e Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-04T09:33:10Z IGNITE-8122 Restore partition state to OWNING if unable to read from page memory. commit c43c598916bd9bda2f624a214cefcc28b0380a5c Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-05T14:26:53Z IGNITE-8122 Restore partitions when persistence is enabled with OWNING default state. commit a061cdba3a46f5384910fecf89366101f904ccdf Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-05T14:50:20Z IGNITE-8122 Move OWN logic to GridDhtLocalPartition constructor. commit 9259407a462633a68de6dcd7d3ae135c8c7c0b37 Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-05T17:15:39Z IGNITE-8122 Docs. commit 20979dc4874cfeca649b29d8827d522f3e57ee67 Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-05T17:16:54Z Merge branch 'master' into ignite-8122 commit 791ef918335e66d274a10c2b5d21c9da1c212a0b Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-06T12:53:21Z IGNITE-8122 Restore partition in OWNING state correctly. commit 56bdb20513731f49bf3a2b51f672396efc16bfe1 Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-09T11:42:59Z IGNITE-8122 Restore partition states on before exchange in case of starting cache group. commit dcea5b47a5a08f165930a2e0235ecf51385f4997 Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-09T11:55:53Z IGNITE-8122 Fixed test with auto-activation. commit 915788c7ea25f04ccacfa06e1f19c06c92f7c141 Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-09T15:32:59Z IGNITE-8116 WIP commit d224cd5663415dc8636f6ba01bfd5002fdb35a3e Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-10T10:59:20Z Merge branch 'ignite-8122' into ignite-8116-8122 commit e988272e57760ee44753314ad8db2adab344a6c0 Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-10T11:47:04Z IGNITE-8116 WIP commit bd53648935b24fb2c68a885db81656f2d960ba1e Author: Pavel Kovalenko <jokserfn@...> Date: 2018-04-10T18:09:32Z IGNITE-8116 WAL rebalance issues. ---