Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-06-21 Thread ткаленко кирилл
Created the second task by this discussion, IGNITE-14952.

17.06.2021, 14:26, "ткаленко кирилл" :
> Created the first task by this discussion IGNITE-14923.
>
> 13.05.2021, 18:37, "Stanislav Lukyanov" :
>>  What I mean by degradation when archive size < min is that, for example, 
>> historical rebalance is available for a smaller timespan than expected by 
>> the system design.
>>  It may not be an issue of course, especially for a new cluster. If 
>> "degradation" is the wrong word we can call it "non-steady state" :)
>>  In any case, I think we're on the same page.
>>
>>>   On 11 May 2021, at 13:18, Andrey Gura  wrote:
>>>
>>>   Stan
>>>
>>>>   If archive size is less than min or more than max then the system 
>>>> functionality can degrade (e.g. historical rebalance may not work as 
>>>> expected).
>>>
>>>   Why does the condition "archive size is less than min" lead to system
>>>   degradation? Actually, the described case is a normal situation for
>>>   brand new clusters.
>>>
>>>   I'm okay with the proposed minWalArchiveSize property. Looks like
>>>   relatively understandable property.
>>>
>>>   On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov
>>>    wrote:
>>>>   Discuss this with Kirill verbally.
>>>>
>>>>   Kirill showed me that having the min threshold doesn't quite work.
>>>>   It doesn't work because we no longer know how much WAL we should remove 
>>>> if we reach getMaxWalArchiveSize.
>>>>
>>>>   For example, say we have minWalArchiveTimespan=2 hours and 
>>>> maxWalArchiveSize=2GB.
>>>>   Say, under normal load on stable topology 2 hours of WAL use 1 GB of 
>>>> space.
>>>>   Now, say we're doing historical rebalance and reserve the WAL archive.
>>>>   The WAL archive starts growing and soon it occupies 2 GB.
>>>>   Now what?
>>>>   We're supposed to give up WAL reservations and start agressively 
>>>> removing WAL archive.
>>>>   But it is not clear when can we stop removing WAL archive - since last 2 
>>>> hours of WAL are larger than our maxWalArchiveSize
>>>>   there is no meaningful point the system can use as a "minimum" WAL size.
>>>>
>>>>   I understand the description above is a bit messy but I believe that 
>>>> whoever is interested in this will understand it
>>>>   after drawing this on paper.
>>>>
>>>>   I'm giving up on my latest suggestion about time-based minimum. Let's 
>>>> keep it simple.
>>>>
>>>>   I suggest the minWalArchiveSize and maxWalArchvieSize properties as the 
>>>> solution,
>>>>   with the behavior as initially described by Kirill.
>>>>
>>>>   Stan
>>>>
>>>>>   On 7 May 2021, at 15:09, ткаленко кирилл  wrote:
>>>>>
>>>>>   Stas hello!
>>>>>
>>>>>   I didn't quite get your last idea.
>>>>>   What will we do if we reach getMaxWalArchiveSize? Shall we not delete 
>>>>> the segment until minWalArchiveTimespan?
>>>>>
>>>>>   06.05.2021, 20:00, "Stanislav Lukyanov" :
>>>>>>   An interesting suggestion I heard today.
>>>>>>
>>>>>>   The minWalArchiveSize property might actually be minWalArchiveTimespan 
>>>>>> - i.e. be a number of seconds instead of a number of bytes!
>>>>>>
>>>>>>   I think this makes perfect sense from the user point of view.
>>>>>>   "I want to have WAL archive for at least N hours but I have a limit of 
>>>>>> M gigabytes to store it".
>>>>>>
>>>>>>   Do we have checkpoint timestamp stored anywhere? (cp start markers?)
>>>>>>   Perhaps we can actually implement this?
>>>>>>
>>>>>>   Thanks,
>>>>>>   Stan
>>>>>>
>>>>>>>   On 6 May 2021, at 14:13, Stanislav Lukyanov  
>>>>>>> wrote:
>>>>>>>
>>>>>>>   +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
>>>>>>>   +1 to add a public property to replace 
>>>>>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
>>>>>>>
>>>>>>>   I don't like the name getWalArchiveS

Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-06-17 Thread ткаленко кирилл
Created the first task by this discussion IGNITE-14923.

13.05.2021, 18:37, "Stanislav Lukyanov" :
> What I mean by degradation when archive size < min is that, for example, 
> historical rebalance is available for a smaller timespan than expected by the 
> system design.
> It may not be an issue of course, especially for a new cluster. If 
> "degradation" is the wrong word we can call it "non-steady state" :)
> In any case, I think we're on the same page.
>
>>  On 11 May 2021, at 13:18, Andrey Gura  wrote:
>>
>>  Stan
>>
>>>  If archive size is less than min or more than max then the system 
>>> functionality can degrade (e.g. historical rebalance may not work as 
>>> expected).
>>
>>  Why does the condition "archive size is less than min" lead to system
>>  degradation? Actually, the described case is a normal situation for
>>  brand new clusters.
>>
>>  I'm okay with the proposed minWalArchiveSize property. Looks like
>>  relatively understandable property.
>>
>>  On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov
>>   wrote:
>>>  Discuss this with Kirill verbally.
>>>
>>>  Kirill showed me that having the min threshold doesn't quite work.
>>>  It doesn't work because we no longer know how much WAL we should remove if 
>>> we reach getMaxWalArchiveSize.
>>>
>>>  For example, say we have minWalArchiveTimespan=2 hours and 
>>> maxWalArchiveSize=2GB.
>>>  Say, under normal load on stable topology 2 hours of WAL use 1 GB of space.
>>>  Now, say we're doing historical rebalance and reserve the WAL archive.
>>>  The WAL archive starts growing and soon it occupies 2 GB.
>>>  Now what?
>>>  We're supposed to give up WAL reservations and start agressively removing 
>>> WAL archive.
>>>  But it is not clear when can we stop removing WAL archive - since last 2 
>>> hours of WAL are larger than our maxWalArchiveSize
>>>  there is no meaningful point the system can use as a "minimum" WAL size.
>>>
>>>  I understand the description above is a bit messy but I believe that 
>>> whoever is interested in this will understand it
>>>  after drawing this on paper.
>>>
>>>  I'm giving up on my latest suggestion about time-based minimum. Let's keep 
>>> it simple.
>>>
>>>  I suggest the minWalArchiveSize and maxWalArchvieSize properties as the 
>>> solution,
>>>  with the behavior as initially described by Kirill.
>>>
>>>  Stan
>>>
>>>>  On 7 May 2021, at 15:09, ткаленко кирилл  wrote:
>>>>
>>>>  Stas hello!
>>>>
>>>>  I didn't quite get your last idea.
>>>>  What will we do if we reach getMaxWalArchiveSize? Shall we not delete the 
>>>> segment until minWalArchiveTimespan?
>>>>
>>>>  06.05.2021, 20:00, "Stanislav Lukyanov" :
>>>>>  An interesting suggestion I heard today.
>>>>>
>>>>>  The minWalArchiveSize property might actually be minWalArchiveTimespan - 
>>>>> i.e. be a number of seconds instead of a number of bytes!
>>>>>
>>>>>  I think this makes perfect sense from the user point of view.
>>>>>  "I want to have WAL archive for at least N hours but I have a limit of M 
>>>>> gigabytes to store it".
>>>>>
>>>>>  Do we have checkpoint timestamp stored anywhere? (cp start markers?)
>>>>>  Perhaps we can actually implement this?
>>>>>
>>>>>  Thanks,
>>>>>  Stan
>>>>>
>>>>>>  On 6 May 2021, at 14:13, Stanislav Lukyanov  
>>>>>> wrote:
>>>>>>
>>>>>>  +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
>>>>>>  +1 to add a public property to replace 
>>>>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
>>>>>>
>>>>>>  I don't like the name getWalArchiveSize - I think it's a bit confusing 
>>>>>> (is it the current size? the minimal size? the target size?)
>>>>>>  I suggest to name the property geMintWalArchiveSize. I think that this 
>>>>>> is exactly what it is - the minimal size of the archive that we want to 
>>>>>> have.
>>>>>>  The archive size at all times should be between min and max.
>>>>>>  If archive size is less than min or more than max then the system 
>

Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-13 Thread Stanislav Lukyanov
What I mean by degradation when archive size < min is that, for example, 
historical rebalance is available for a smaller timespan than expected by the 
system design.
It may not be an issue of course, especially for a new cluster. If 
"degradation" is the wrong word we can call it "non-steady state" :) 
In any case, I think we're on the same page.


> On 11 May 2021, at 13:18, Andrey Gura  wrote:
> 
> Stan
> 
>> If archive size is less than min or more than max then the system 
>> functionality can degrade (e.g. historical rebalance may not work as 
>> expected).
> 
> Why does the condition "archive size is less than min" lead to system
> degradation? Actually, the described case is a normal situation for
> brand new clusters.
> 
> I'm okay with the proposed minWalArchiveSize property. Looks like
> relatively understandable property.
> 
> On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov
>  wrote:
>> 
>> Discuss this with Kirill verbally.
>> 
>> Kirill showed me that having the min threshold doesn't quite work.
>> It doesn't work because we no longer know how much WAL we should remove if 
>> we reach getMaxWalArchiveSize.
>> 
>> For example, say we have minWalArchiveTimespan=2 hours and 
>> maxWalArchiveSize=2GB.
>> Say, under normal load on stable topology 2 hours of WAL use 1 GB of space.
>> Now, say we're doing historical rebalance and reserve the WAL archive.
>> The WAL archive starts growing and soon it occupies 2 GB.
>> Now what?
>> We're supposed to give up WAL reservations and start agressively removing 
>> WAL archive.
>> But it is not clear when can we stop removing WAL archive - since last 2 
>> hours of WAL are larger than our maxWalArchiveSize
>> there is no meaningful point the system can use as a "minimum" WAL size.
>> 
>> I understand the description above is a bit messy but I believe that whoever 
>> is interested in this will understand it
>> after drawing this on paper.
>> 
>> 
>> I'm giving up on my latest suggestion about time-based minimum. Let's keep 
>> it simple.
>> 
>> I suggest the minWalArchiveSize and maxWalArchvieSize properties as the 
>> solution,
>> with the behavior as initially described by Kirill.
>> 
>> Stan
>> 
>> 
>>> On 7 May 2021, at 15:09, ткаленко кирилл  wrote:
>>> 
>>> Stas hello!
>>> 
>>> I didn't quite get your last idea.
>>> What will we do if we reach getMaxWalArchiveSize? Shall we not delete the 
>>> segment until minWalArchiveTimespan?
>>> 
>>> 06.05.2021, 20:00, "Stanislav Lukyanov" :
>>>> An interesting suggestion I heard today.
>>>> 
>>>> The minWalArchiveSize property might actually be minWalArchiveTimespan - 
>>>> i.e. be a number of seconds instead of a number of bytes!
>>>> 
>>>> I think this makes perfect sense from the user point of view.
>>>> "I want to have WAL archive for at least N hours but I have a limit of M 
>>>> gigabytes to store it".
>>>> 
>>>> Do we have checkpoint timestamp stored anywhere? (cp start markers?)
>>>> Perhaps we can actually implement this?
>>>> 
>>>> Thanks,
>>>> Stan
>>>> 
>>>>> On 6 May 2021, at 14:13, Stanislav Lukyanov  
>>>>> wrote:
>>>>> 
>>>>> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
>>>>> +1 to add a public property to replace 
>>>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
>>>>> 
>>>>> I don't like the name getWalArchiveSize - I think it's a bit confusing 
>>>>> (is it the current size? the minimal size? the target size?)
>>>>> I suggest to name the property geMintWalArchiveSize. I think that this is 
>>>>> exactly what it is - the minimal size of the archive that we want to have.
>>>>> The archive size at all times should be between min and max.
>>>>> If archive size is less than min or more than max then the system 
>>>>> functionality can degrade (e.g. historical rebalance may not work as 
>>>>> expected).
>>>>> I think these rules are intuitively understood from the "min" and "max" 
>>>>> names.
>>>>> 
>>>>> Ilya's suggestion about throttling is great although I'd do this in a 
>>>>> different ticket.
>>>>> 
>>>>> Thanks,
>>>>> Stan
>

Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-11 Thread Andrey Gura
Stan

> If archive size is less than min or more than max then the system 
> functionality can degrade (e.g. historical rebalance may not work as 
> expected).

Why does the condition "archive size is less than min" lead to system
degradation? Actually, the described case is a normal situation for
brand new clusters.

I'm okay with the proposed minWalArchiveSize property. Looks like
relatively understandable property.

On Sun, May 9, 2021 at 7:12 PM Stanislav Lukyanov
 wrote:
>
> Discuss this with Kirill verbally.
>
> Kirill showed me that having the min threshold doesn't quite work.
> It doesn't work because we no longer know how much WAL we should remove if we 
> reach getMaxWalArchiveSize.
>
> For example, say we have minWalArchiveTimespan=2 hours and 
> maxWalArchiveSize=2GB.
> Say, under normal load on stable topology 2 hours of WAL use 1 GB of space.
> Now, say we're doing historical rebalance and reserve the WAL archive.
> The WAL archive starts growing and soon it occupies 2 GB.
> Now what?
> We're supposed to give up WAL reservations and start agressively removing WAL 
> archive.
> But it is not clear when can we stop removing WAL archive - since last 2 
> hours of WAL are larger than our maxWalArchiveSize
> there is no meaningful point the system can use as a "minimum" WAL size.
>
> I understand the description above is a bit messy but I believe that whoever 
> is interested in this will understand it
> after drawing this on paper.
>
>
> I'm giving up on my latest suggestion about time-based minimum. Let's keep it 
> simple.
>
> I suggest the minWalArchiveSize and maxWalArchvieSize properties as the 
> solution,
> with the behavior as initially described by Kirill.
>
> Stan
>
>
> > On 7 May 2021, at 15:09, ткаленко кирилл  wrote:
> >
> > Stas hello!
> >
> > I didn't quite get your last idea.
> > What will we do if we reach getMaxWalArchiveSize? Shall we not delete the 
> > segment until minWalArchiveTimespan?
> >
> > 06.05.2021, 20:00, "Stanislav Lukyanov" :
> >> An interesting suggestion I heard today.
> >>
> >> The minWalArchiveSize property might actually be minWalArchiveTimespan - 
> >> i.e. be a number of seconds instead of a number of bytes!
> >>
> >> I think this makes perfect sense from the user point of view.
> >> "I want to have WAL archive for at least N hours but I have a limit of M 
> >> gigabytes to store it".
> >>
> >> Do we have checkpoint timestamp stored anywhere? (cp start markers?)
> >> Perhaps we can actually implement this?
> >>
> >> Thanks,
> >> Stan
> >>
> >>>  On 6 May 2021, at 14:13, Stanislav Lukyanov  
> >>> wrote:
> >>>
> >>>  +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
> >>>  +1 to add a public property to replace 
> >>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
> >>>
> >>>  I don't like the name getWalArchiveSize - I think it's a bit confusing 
> >>> (is it the current size? the minimal size? the target size?)
> >>>  I suggest to name the property geMintWalArchiveSize. I think that this 
> >>> is exactly what it is - the minimal size of the archive that we want to 
> >>> have.
> >>>  The archive size at all times should be between min and max.
> >>>  If archive size is less than min or more than max then the system 
> >>> functionality can degrade (e.g. historical rebalance may not work as 
> >>> expected).
> >>>  I think these rules are intuitively understood from the "min" and "max" 
> >>> names.
> >>>
> >>>  Ilya's suggestion about throttling is great although I'd do this in a 
> >>> different ticket.
> >>>
> >>>  Thanks,
> >>>  Stan
> >>>
> >>>>  On 5 May 2021, at 19:25, Maxim Muzafarov  wrote:
> >>>>
> >>>>  Hello, Kirill
> >>>>
> >>>>  +1 for this change, however, there are too many configuration settings
> >>>>  that exist for the user to configure Ignite cluster. It is better to
> >>>>  keep the options that we already have and fix the behaviour of the
> >>>>  rebalance process as you suggested.
> >>>>
> >>>>  On Tue, 4 May 2021 at 19:01, ткаленко кирилл  
> >>>> wrote:
> >>>>>  Hi Ilya!
> >>>>>
> >>>>>  Then we can greatly reduce the user load on the clu

Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-09 Thread Stanislav Lukyanov
Discuss this with Kirill verbally.

Kirill showed me that having the min threshold doesn't quite work.
It doesn't work because we no longer know how much WAL we should remove if we 
reach getMaxWalArchiveSize.

For example, say we have minWalArchiveTimespan=2 hours and 
maxWalArchiveSize=2GB.
Say, under normal load on stable topology 2 hours of WAL use 1 GB of space.
Now, say we're doing historical rebalance and reserve the WAL archive.
The WAL archive starts growing and soon it occupies 2 GB.
Now what?
We're supposed to give up WAL reservations and start agressively removing WAL 
archive.
But it is not clear when can we stop removing WAL archive - since last 2 hours 
of WAL are larger than our maxWalArchiveSize
there is no meaningful point the system can use as a "minimum" WAL size.

I understand the description above is a bit messy but I believe that whoever is 
interested in this will understand it
after drawing this on paper.


I'm giving up on my latest suggestion about time-based minimum. Let's keep it 
simple.

I suggest the minWalArchiveSize and maxWalArchvieSize properties as the 
solution,
with the behavior as initially described by Kirill.

Stan


> On 7 May 2021, at 15:09, ткаленко кирилл  wrote:
> 
> Stas hello!
> 
> I didn't quite get your last idea. 
> What will we do if we reach getMaxWalArchiveSize? Shall we not delete the 
> segment until minWalArchiveTimespan?
> 
> 06.05.2021, 20:00, "Stanislav Lukyanov" :
>> An interesting suggestion I heard today.
>> 
>> The minWalArchiveSize property might actually be minWalArchiveTimespan - 
>> i.e. be a number of seconds instead of a number of bytes!
>> 
>> I think this makes perfect sense from the user point of view.
>> "I want to have WAL archive for at least N hours but I have a limit of M 
>> gigabytes to store it".
>> 
>> Do we have checkpoint timestamp stored anywhere? (cp start markers?)
>> Perhaps we can actually implement this?
>> 
>> Thanks,
>> Stan
>> 
>>>  On 6 May 2021, at 14:13, Stanislav Lukyanov  wrote:
>>> 
>>>  +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
>>>  +1 to add a public property to replace 
>>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
>>> 
>>>  I don't like the name getWalArchiveSize - I think it's a bit confusing (is 
>>> it the current size? the minimal size? the target size?)
>>>  I suggest to name the property geMintWalArchiveSize. I think that this is 
>>> exactly what it is - the minimal size of the archive that we want to have.
>>>  The archive size at all times should be between min and max.
>>>  If archive size is less than min or more than max then the system 
>>> functionality can degrade (e.g. historical rebalance may not work as 
>>> expected).
>>>  I think these rules are intuitively understood from the "min" and "max" 
>>> names.
>>> 
>>>  Ilya's suggestion about throttling is great although I'd do this in a 
>>> different ticket.
>>> 
>>>  Thanks,
>>>  Stan
>>> 
>>>>  On 5 May 2021, at 19:25, Maxim Muzafarov  wrote:
>>>> 
>>>>  Hello, Kirill
>>>> 
>>>>  +1 for this change, however, there are too many configuration settings
>>>>  that exist for the user to configure Ignite cluster. It is better to
>>>>  keep the options that we already have and fix the behaviour of the
>>>>  rebalance process as you suggested.
>>>> 
>>>>  On Tue, 4 May 2021 at 19:01, ткаленко кирилл  wrote:
>>>>>  Hi Ilya!
>>>>> 
>>>>>  Then we can greatly reduce the user load on the cluster until the 
>>>>> rebalance is over. Which can be critical for the user.
>>>>> 
>>>>>  04.05.2021, 18:43, "Ilya Kasnacheev" :
>>>>>>  Hello!
>>>>>> 
>>>>>>  Maybe we can have a mechanic here similar (or equal) to checkpoint based
>>>>>>  write throttling?
>>>>>> 
>>>>>>  So we will be throttling for both checkpoint page buffer and WAL limit.
>>>>>> 
>>>>>>  Regards,
>>>>>>  --
>>>>>>  Ilya Kasnacheev
>>>>>> 
>>>>>>  вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :
>>>>>> 
>>>>>>>  Hello everybody!
>>>>>>> 
>>>>>>>  At the moment, if there are partitions for the rebalance for which the
>>>>>>>  historical rebal

Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-07 Thread ткаленко кирилл
Stas hello!

I didn't quite get your last idea. 
What will we do if we reach getMaxWalArchiveSize? Shall we not delete the 
segment until minWalArchiveTimespan?

06.05.2021, 20:00, "Stanislav Lukyanov" :
> An interesting suggestion I heard today.
>
> The minWalArchiveSize property might actually be minWalArchiveTimespan - i.e. 
> be a number of seconds instead of a number of bytes!
>
> I think this makes perfect sense from the user point of view.
> "I want to have WAL archive for at least N hours but I have a limit of M 
> gigabytes to store it".
>
> Do we have checkpoint timestamp stored anywhere? (cp start markers?)
> Perhaps we can actually implement this?
>
> Thanks,
> Stan
>
>>  On 6 May 2021, at 14:13, Stanislav Lukyanov  wrote:
>>
>>  +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
>>  +1 to add a public property to replace 
>> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
>>
>>  I don't like the name getWalArchiveSize - I think it's a bit confusing (is 
>> it the current size? the minimal size? the target size?)
>>  I suggest to name the property geMintWalArchiveSize. I think that this is 
>> exactly what it is - the minimal size of the archive that we want to have.
>>  The archive size at all times should be between min and max.
>>  If archive size is less than min or more than max then the system 
>> functionality can degrade (e.g. historical rebalance may not work as 
>> expected).
>>  I think these rules are intuitively understood from the "min" and "max" 
>> names.
>>
>>  Ilya's suggestion about throttling is great although I'd do this in a 
>> different ticket.
>>
>>  Thanks,
>>  Stan
>>
>>>  On 5 May 2021, at 19:25, Maxim Muzafarov  wrote:
>>>
>>>  Hello, Kirill
>>>
>>>  +1 for this change, however, there are too many configuration settings
>>>  that exist for the user to configure Ignite cluster. It is better to
>>>  keep the options that we already have and fix the behaviour of the
>>>  rebalance process as you suggested.
>>>
>>>  On Tue, 4 May 2021 at 19:01, ткаленко кирилл  wrote:
>>>>  Hi Ilya!
>>>>
>>>>  Then we can greatly reduce the user load on the cluster until the 
>>>> rebalance is over. Which can be critical for the user.
>>>>
>>>>  04.05.2021, 18:43, "Ilya Kasnacheev" :
>>>>>  Hello!
>>>>>
>>>>>  Maybe we can have a mechanic here similar (or equal) to checkpoint based
>>>>>  write throttling?
>>>>>
>>>>>  So we will be throttling for both checkpoint page buffer and WAL limit.
>>>>>
>>>>>  Regards,
>>>>>  --
>>>>>  Ilya Kasnacheev
>>>>>
>>>>>  вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :
>>>>>
>>>>>>  Hello everybody!
>>>>>>
>>>>>>  At the moment, if there are partitions for the rebalance for which the
>>>>>>  historical rebalance will be used, then we reserve segments in the WAL
>>>>>>  archive (we do not allow cleaning the WAL archive) until the rebalance 
>>>>>> for
>>>>>>  all cache groups is over.
>>>>>>
>>>>>>  If a cluster is under load during the rebalance, WAL archive size may
>>>>>>  significantly exceed limits set in
>>>>>>  DataStorageConfiguration#getMaxWalArchiveSize until the process is
>>>>>>  complete. This may lead to user issues and nodes may crash with the "No
>>>>>>  space left on device" error.
>>>>>>
>>>>>>  We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE 
>>>>>> by
>>>>>>  default 0.5, which sets the threshold (multiplied by 
>>>>>> getMaxWalArchiveSize)
>>>>>>  from which and up to which the WAL archive will be cleared, i.e. sets 
>>>>>> the
>>>>>>  size of the WAL archive that will always be on the node. I propose to
>>>>>>  replace this system property with the
>>>>>>  DataStorageConfiguration#getWalArchiveSize in bytes, the default is
>>>>>>  (getMaxWalArchiveSize * 0.5) as it is now.
>>>>>>
>>>>>>  Main proposal:
>>>>>>  When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
>>>>>>  and do not give the reservation of the WAL segments until we reach
>>>>>>  DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
>>>>>>  segment for historical rebalance, we will automatically switch to full
>>>>>>  rebalance.


Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-06 Thread Stanislav Lukyanov
An interesting suggestion I heard today.

The minWalArchiveSize property might actually be minWalArchiveTimespan - i.e. 
be a number of seconds instead of a number of bytes!

I think this makes perfect sense from the user point of view.
"I want to have WAL archive for at least N hours but I have a limit of M 
gigabytes to store it".

Do we have checkpoint timestamp stored anywhere? (cp start markers?)
Perhaps we can actually implement this?

Thanks,
Stan


> On 6 May 2021, at 14:13, Stanislav Lukyanov  wrote:
> 
> +1 to cancel WAL reservation on reaching getMaxWalArchiveSize
> +1 to add a public property to replace 
> IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE
> 
> I don't like the name getWalArchiveSize - I think it's a bit confusing (is it 
> the current size? the minimal size? the target size?)
> I suggest to name the property geMintWalArchiveSize. I think that this is 
> exactly what it is - the minimal size of the archive that we want to have.
> The archive size at all times should be between min and max.
> If archive size is less than min or more than max then the system 
> functionality can degrade (e.g. historical rebalance may not work as 
> expected).
> I think these rules are intuitively understood from the "min" and "max" names.
> 
> Ilya's suggestion about throttling is great although I'd do this in a 
> different ticket.
> 
> Thanks,
> Stan
> 
>> On 5 May 2021, at 19:25, Maxim Muzafarov  wrote:
>> 
>> Hello, Kirill
>> 
>> +1 for this change, however, there are too many configuration settings
>> that exist for the user to configure Ignite cluster. It is better to
>> keep the options that we already have and fix the behaviour of the
>> rebalance process as you suggested.
>> 
>> On Tue, 4 May 2021 at 19:01, ткаленко кирилл  wrote:
>>> 
>>> Hi Ilya!
>>> 
>>> Then we can greatly reduce the user load on the cluster until the rebalance 
>>> is over. Which can be critical for the user.
>>> 
>>> 04.05.2021, 18:43, "Ilya Kasnacheev" :
>>>> Hello!
>>>> 
>>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based
>>>> write throttling?
>>>> 
>>>> So we will be throttling for both checkpoint page buffer and WAL limit.
>>>> 
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>> 
>>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :
>>>> 
>>>>> Hello everybody!
>>>>> 
>>>>> At the moment, if there are partitions for the rebalance for which the
>>>>> historical rebalance will be used, then we reserve segments in the WAL
>>>>> archive (we do not allow cleaning the WAL archive) until the rebalance for
>>>>> all cache groups is over.
>>>>> 
>>>>> If a cluster is under load during the rebalance, WAL archive size may
>>>>> significantly exceed limits set in
>>>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is
>>>>> complete. This may lead to user issues and nodes may crash with the "No
>>>>> space left on device" error.
>>>>> 
>>>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by
>>>>> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize)
>>>>> from which and up to which the WAL archive will be cleared, i.e. sets the
>>>>> size of the WAL archive that will always be on the node. I propose to
>>>>> replace this system property with the
>>>>> DataStorageConfiguration#getWalArchiveSize in bytes, the default is
>>>>> (getMaxWalArchiveSize * 0.5) as it is now.
>>>>> 
>>>>> Main proposal:
>>>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
>>>>> and do not give the reservation of the WAL segments until we reach
>>>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
>>>>> segment for historical rebalance, we will automatically switch to full
>>>>> rebalance.
> 



Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-06 Thread Stanislav Lukyanov
+1 to cancel WAL reservation on reaching getMaxWalArchiveSize
+1 to add a public property to replace 
IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE

I don't like the name getWalArchiveSize - I think it's a bit confusing (is it 
the current size? the minimal size? the target size?)
I suggest to name the property geMintWalArchiveSize. I think that this is 
exactly what it is - the minimal size of the archive that we want to have.
The archive size at all times should be between min and max.
If archive size is less than min or more than max then the system functionality 
can degrade (e.g. historical rebalance may not work as expected).
I think these rules are intuitively understood from the "min" and "max" names.

Ilya's suggestion about throttling is great although I'd do this in a different 
ticket.

Thanks,
Stan

> On 5 May 2021, at 19:25, Maxim Muzafarov  wrote:
> 
> Hello, Kirill
> 
> +1 for this change, however, there are too many configuration settings
> that exist for the user to configure Ignite cluster. It is better to
> keep the options that we already have and fix the behaviour of the
> rebalance process as you suggested.
> 
> On Tue, 4 May 2021 at 19:01, ткаленко кирилл  wrote:
>> 
>> Hi Ilya!
>> 
>> Then we can greatly reduce the user load on the cluster until the rebalance 
>> is over. Which can be critical for the user.
>> 
>> 04.05.2021, 18:43, "Ilya Kasnacheev" :
>>> Hello!
>>> 
>>> Maybe we can have a mechanic here similar (or equal) to checkpoint based
>>> write throttling?
>>> 
>>> So we will be throttling for both checkpoint page buffer and WAL limit.
>>> 
>>> Regards,
>>> --
>>> Ilya Kasnacheev
>>> 
>>> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :
>>> 
>>>> Hello everybody!
>>>> 
>>>> At the moment, if there are partitions for the rebalance for which the
>>>> historical rebalance will be used, then we reserve segments in the WAL
>>>> archive (we do not allow cleaning the WAL archive) until the rebalance for
>>>> all cache groups is over.
>>>> 
>>>> If a cluster is under load during the rebalance, WAL archive size may
>>>> significantly exceed limits set in
>>>> DataStorageConfiguration#getMaxWalArchiveSize until the process is
>>>> complete. This may lead to user issues and nodes may crash with the "No
>>>> space left on device" error.
>>>> 
>>>> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by
>>>> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize)
>>>> from which and up to which the WAL archive will be cleared, i.e. sets the
>>>> size of the WAL archive that will always be on the node. I propose to
>>>> replace this system property with the
>>>>  DataStorageConfiguration#getWalArchiveSize in bytes, the default is
>>>> (getMaxWalArchiveSize * 0.5) as it is now.
>>>> 
>>>> Main proposal:
>>>> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
>>>> and do not give the reservation of the WAL segments until we reach
>>>> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
>>>> segment for historical rebalance, we will automatically switch to full
>>>> rebalance.



Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-05 Thread Maxim Muzafarov
Hello, Kirill

+1 for this change, however, there are too many configuration settings
that exist for the user to configure Ignite cluster. It is better to
keep the options that we already have and fix the behaviour of the
rebalance process as you suggested.

On Tue, 4 May 2021 at 19:01, ткаленко кирилл  wrote:
>
> Hi Ilya!
>
> Then we can greatly reduce the user load on the cluster until the rebalance 
> is over. Which can be critical for the user.
>
> 04.05.2021, 18:43, "Ilya Kasnacheev" :
> > Hello!
> >
> > Maybe we can have a mechanic here similar (or equal) to checkpoint based
> > write throttling?
> >
> > So we will be throttling for both checkpoint page buffer and WAL limit.
> >
> > Regards,
> > --
> > Ilya Kasnacheev
> >
> > вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :
> >
> >>  Hello everybody!
> >>
> >>  At the moment, if there are partitions for the rebalance for which the
> >>  historical rebalance will be used, then we reserve segments in the WAL
> >>  archive (we do not allow cleaning the WAL archive) until the rebalance for
> >>  all cache groups is over.
> >>
> >>  If a cluster is under load during the rebalance, WAL archive size may
> >>  significantly exceed limits set in
> >>  DataStorageConfiguration#getMaxWalArchiveSize until the process is
> >>  complete. This may lead to user issues and nodes may crash with the "No
> >>  space left on device" error.
> >>
> >>  We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by
> >>  default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize)
> >>  from which and up to which the WAL archive will be cleared, i.e. sets the
> >>  size of the WAL archive that will always be on the node. I propose to
> >>  replace this system property with the
> >>   DataStorageConfiguration#getWalArchiveSize in bytes, the default is
> >>  (getMaxWalArchiveSize * 0.5) as it is now.
> >>
> >>  Main proposal:
> >>  When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
> >>  and do not give the reservation of the WAL segments until we reach
> >>  DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
> >>  segment for historical rebalance, we will automatically switch to full
> >>  rebalance.


Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-04 Thread ткаленко кирилл
Hi Ilya!

Then we can greatly reduce the user load on the cluster until the rebalance is 
over. Which can be critical for the user.

04.05.2021, 18:43, "Ilya Kasnacheev" :
> Hello!
>
> Maybe we can have a mechanic here similar (or equal) to checkpoint based
> write throttling?
>
> So we will be throttling for both checkpoint page buffer and WAL limit.
>
> Regards,
> --
> Ilya Kasnacheev
>
> вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :
>
>>  Hello everybody!
>>
>>  At the moment, if there are partitions for the rebalance for which the
>>  historical rebalance will be used, then we reserve segments in the WAL
>>  archive (we do not allow cleaning the WAL archive) until the rebalance for
>>  all cache groups is over.
>>
>>  If a cluster is under load during the rebalance, WAL archive size may
>>  significantly exceed limits set in
>>  DataStorageConfiguration#getMaxWalArchiveSize until the process is
>>  complete. This may lead to user issues and nodes may crash with the "No
>>  space left on device" error.
>>
>>  We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by
>>  default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize)
>>  from which and up to which the WAL archive will be cleared, i.e. sets the
>>  size of the WAL archive that will always be on the node. I propose to
>>  replace this system property with the
>>   DataStorageConfiguration#getWalArchiveSize in bytes, the default is
>>  (getMaxWalArchiveSize * 0.5) as it is now.
>>
>>  Main proposal:
>>  When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
>>  and do not give the reservation of the WAL segments until we reach
>>  DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
>>  segment for historical rebalance, we will automatically switch to full
>>  rebalance.


Re: Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-04 Thread Ilya Kasnacheev
Hello!

Maybe we can have a mechanic here similar (or equal) to checkpoint based
write throttling?

So we will be throttling for both checkpoint page buffer and WAL limit.

Regards,
-- 
Ilya Kasnacheev


вт, 4 мая 2021 г. в 11:29, ткаленко кирилл :

> Hello everybody!
>
> At the moment, if there are partitions for the rebalance for which the
> historical rebalance will be used, then we reserve segments in the WAL
> archive (we do not allow cleaning the WAL archive) until the rebalance for
> all cache groups is over.
>
> If a cluster is under load during the rebalance, WAL archive size may
> significantly exceed limits set in
> DataStorageConfiguration#getMaxWalArchiveSize until the process is
> complete. This may lead to user issues and nodes may crash with the "No
> space left on device" error.
>
> We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by
> default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize)
> from which and up to which the WAL archive will be cleared, i.e. sets the
> size of the WAL archive that will always be on the node. I propose to
> replace this system property with the
>  DataStorageConfiguration#getWalArchiveSize in bytes, the default is
> (getMaxWalArchiveSize * 0.5) as it is now.
>
> Main proposal:
> When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel
> and do not give the reservation of the WAL segments until we reach
> DataStorageConfiguration#getWalArchiveSize. In this case, if there is no
> segment for historical rebalance, we will automatically switch to full
> rebalance.
>


Exceeding the DataStorageConfiguration#getMaxWalArchiveSize due to historical rebalance

2021-05-04 Thread ткаленко кирилл
Hello everybody!

At the moment, if there are partitions for the rebalance for which the 
historical rebalance will be used, then we reserve segments in the WAL archive 
(we do not allow cleaning the WAL archive) until the rebalance for all cache 
groups is over.

If a cluster is under load during the rebalance, WAL archive size may 
significantly exceed limits set in 
DataStorageConfiguration#getMaxWalArchiveSize until the process is complete. 
This may lead to user issues and nodes may crash with the "No space left on 
device" error.

We have a system property IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE by 
default 0.5, which sets the threshold (multiplied by getMaxWalArchiveSize) from 
which and up to which the WAL archive will be cleared, i.e. sets the size of 
the WAL archive that will always be on the node. I propose to replace this 
system property with the  DataStorageConfiguration#getWalArchiveSize in bytes, 
the default is (getMaxWalArchiveSize * 0.5) as it is now.

Main proposal:
When theDataStorageConfiguration#getMaxWalArchiveSize is reached, cancel and do 
not give the reservation of the WAL segments until we reach 
DataStorageConfiguration#getWalArchiveSize. In this case, if there is no 
segment for historical rebalance, we will automatically switch to full 
rebalance.


[jira] [Created] (IGNITE-14524) Historical rebalance doesn't work if cache has configured rebalanceDelay

2021-04-12 Thread Dmitry Lazurkin (Jira)
Dmitry Lazurkin created IGNITE-14524:


 Summary: Historical rebalance doesn't work if cache has configured 
rebalanceDelay
 Key: IGNITE-14524
 URL: https://issues.apache.org/jira/browse/IGNITE-14524
 Project: Ignite
  Issue Type: Bug
Affects Versions: 2.10
Reporter: Dmitry Lazurkin


I have big cache with configured rebalanceMode = ASYNC, rebalanceDelay =
10_000ms. Persistence is enabled, maxWalArchiveSize = 10GB. And I passed
-DIGNITE_PREFER_WAL_REBALANCE=true and 
-DIGNITE_PDS_WAL_REBALANCE_THRESHOLD=1 to Ignite. So node should use
historical rebalance if there is enough WAL. But it doesn't. After
investigation I found that GridDhtPreloader#generateAssignments always
get called with exchFut = null, and this method can't set histPartitions
without exchFut. I think, that problem in
GridCachePartitionExchangeManager
(https://github.com/apache/ignite/blob/bc24f6baf3e9b4f98cf98cc5df67fb5deb5ceb6c/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L3486).
It doesn't call generateAssignments without forcePreload if
rebalanceDelay is configured.

Historical rebalance works after removing rebalanceDelay.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-14138) Historical rebalance kills cluster

2021-02-08 Thread Vladislav Pyatkov (Jira)
Vladislav Pyatkov created IGNITE-14138:
--

 Summary: Historical rebalance kills cluster
 Key: IGNITE-14138
 URL: https://issues.apache.org/jira/browse/IGNITE-14138
 Project: Ignite
  Issue Type: Bug
Reporter: Vladislav Pyatkov


{noformat}
[2021-01-12T05:11:02,142][ERROR][rebalance-#508%---%][] Critical system error 
detected. Will be handled accordingly to configured handler 
[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
o.a.i.IgniteCheckedException: Failed to continue supplying [grp=SQL_USAGES_EPE, 
demander=48254935-7aa9-4ab5-b398-fdaec334fab7, topVer=AffinityTopologyVersion 
[topVer=3, minorTopVer=1
org.apache.ignite.IgniteCheckedException: Failed to continue supplying 
[grp=SQL_1, demander=48254935-7aa9-4ab5-b398-fdaec334fab7, 
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1]]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:571)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:398)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:489)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:474)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$800(GridCacheIoManager.java:109)
 [ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1707)
 [ignite-core.jar]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721)
 [ignite-core.jar]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4300(GridIoManager.java:157)
 [ignite-core.jar]
at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:3011)
 [ignite-core.jar]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1662)
 [ignite-core.jar]
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4900(GridIoManager.java:157)
 [ignite-core.jar]
at 
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1629)
 [ignite-core.jar]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.apache.ignite.IgniteCheckedException: Could not find start 
pointer for partition [part=4, partCntrSince=1115]
at 
org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointHistory.searchEarliestWalPointer(CheckpointHistory.java:557)
 ~[ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.historicalIterator(GridCacheOffheapManager.java:1121)
 ~[ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.rebalanceIterator(IgniteCacheOffheapManagerImpl.java:1195)
 ~[ignite-core.jar]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:322)
 ~[ignite-core.jar]
... 16 more
{noformat}
I believe that it should throw IgniteHistoricalIteratorException instead of 
IgniteCheckedException, so it can be properly handled and rebalance can move to 
the full rebalance instead of killing nodes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Choosing historical rebalance heuristics

2020-07-17 Thread Alexei Scherbakov
Hi.

The proposed approach to improve rebalancing efficiency looks a little bit
naive to me.

A mature solution to a problem should include a computation of a "cost" for
each rebalancing type.

The cost formula should take into consideration various aspects, like
number of pages in the memory cache, page replacement, a disk read speed, a
number of indexes, etc.

For a short term solution the proposed heuristic looks acceptable.

My suggestions for an implementation:

1. Implement the cost framework right now.
The rebalance cost is measured as a sum of all costs for each rebalancing
group. This sum should be minimized.
Rebalancing behavior should be dependent only on cost formula outcome,
without the need of modifying other code.

2. Avoid unnecessary WAL scanning.
Ideally we should scan only segments containing required updates.
We can build for each partition a tracking data structure and adjust it on
each checkpoint and WAL archive removal, then use it for efficient scanning
during WAL iteration.

3. Avoid any calculations during the PME sync phase. The coordinator should
only collect available WAL history, switch lagging OWNING partitions to
MOVING, and send all this to the nodes.
After PME nodes should calculate a cost and choose the most efficient
rebalancing for local data.

4. We should log a reason for choosing a specific rebalancing type due to a
cost.

I'm planning to implement tombstones in the near future. This should remove
the requirement for clearing a partition before full rebalancing in most
cases, improving it's cost.

This can be further enhanced later using some kind of merkle trees, file
based rebalancing or something else.

чт, 16 июл. 2020 г. в 18:33, Ivan Rakov :

> >
> >  I think we can modify the heuristic so
> > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> > reduce it to 500)
> > 2) Select only that partition for historical rebalance where difference
> > between counters less that partition size.
>
> Agreed, let's go this way.
>
> On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov 
> wrote:
>
> > I completely forget about another promise to favor of using historical
> > rebalance where it is possible. When cluster decided to use a full
> balance,
> > demander nodes should clear not empty partitions.
> > This can to consume a long time, in some cases that may be compared with
> a
> > time of rebalance.
> > It also accepts a side of heuristics above.
> >
> > On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov  >
> > wrote:
> >
> > > Ivan,
> > >
> > > I agree with a combined approach: threshold for small partitions and
> > count
> > > of update for partition that outgrew it.
> > > This helps to avoid partitions that update not frequently.
> > >
> > > Reading of a big WAL piece (more than 100Gb) it can happen, when a
> client
> > > configured it intentionally.
> > > There are no doubts we can to read it, otherwise WAL space was not
> > > configured that too large.
> > >
> > > I don't see a connection optimization of iterator and issue in atomic
> > > protocol.
> > > Reordering in WAL, that happened in checkpoint where counter was not
> > > changing, is an extremely rare case and the issue will not solve for
> > > generic case, this should be fixed in bound of protocol.
> > >
> > > I think we can modify the heuristic so
> > > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD
> -
> > > reduce it to 500)
> > > 2) Select only that partition for historical rebalance where difference
> > > between counters less that partition size.
> > >
> > > Also implement mentioned optimization for historical iterator, that may
> > > reduce a time on reading large WAL interval.
> > >
> > > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov 
> > wrote:
> > >
> > >> Hi Vladislav,
> > >>
> > >> Thanks for raising this topic.
> > >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is
> > 500_000)
> > >> is controversial. Assuming that the default number of partitions is
> > 1024,
> > >> cache should contain a really huge amount of data in order to make WAL
> > >> delta rebalancing possible. In fact, it's currently disabled for most
> > >> production cases, which makes rebalancing of persistent caches
> > >> unreasonably
> > >> long.
> > >>
> > >> I think, your approach [1] makes much more sense than the current
> > >> heuristic, let's move forward with the proposed solution.
> &

Re: Choosing historical rebalance heuristics

2020-07-16 Thread Ivan Rakov
>
>  I think we can modify the heuristic so
> 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> reduce it to 500)
> 2) Select only that partition for historical rebalance where difference
> between counters less that partition size.

Agreed, let's go this way.

On Thu, Jul 16, 2020 at 11:03 AM Vladislav Pyatkov 
wrote:

> I completely forget about another promise to favor of using historical
> rebalance where it is possible. When cluster decided to use a full balance,
> demander nodes should clear not empty partitions.
> This can to consume a long time, in some cases that may be compared with a
> time of rebalance.
> It also accepts a side of heuristics above.
>
> On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov 
> wrote:
>
> > Ivan,
> >
> > I agree with a combined approach: threshold for small partitions and
> count
> > of update for partition that outgrew it.
> > This helps to avoid partitions that update not frequently.
> >
> > Reading of a big WAL piece (more than 100Gb) it can happen, when a client
> > configured it intentionally.
> > There are no doubts we can to read it, otherwise WAL space was not
> > configured that too large.
> >
> > I don't see a connection optimization of iterator and issue in atomic
> > protocol.
> > Reordering in WAL, that happened in checkpoint where counter was not
> > changing, is an extremely rare case and the issue will not solve for
> > generic case, this should be fixed in bound of protocol.
> >
> > I think we can modify the heuristic so
> > 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> > reduce it to 500)
> > 2) Select only that partition for historical rebalance where difference
> > between counters less that partition size.
> >
> > Also implement mentioned optimization for historical iterator, that may
> > reduce a time on reading large WAL interval.
> >
> > On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov 
> wrote:
> >
> >> Hi Vladislav,
> >>
> >> Thanks for raising this topic.
> >> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is
> 500_000)
> >> is controversial. Assuming that the default number of partitions is
> 1024,
> >> cache should contain a really huge amount of data in order to make WAL
> >> delta rebalancing possible. In fact, it's currently disabled for most
> >> production cases, which makes rebalancing of persistent caches
> >> unreasonably
> >> long.
> >>
> >> I think, your approach [1] makes much more sense than the current
> >> heuristic, let's move forward with the proposed solution.
> >>
> >> Though, there are some other corner cases, e.g. this one:
> >> - Configured size of WAL archive is big (>100 GB)
> >> - Cache has small partitions (e.g. 1000 entries)
> >> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> >> - There is another cache with very frequent updates which allocate >99%
> of
> >> WAL
> >> In such scenario we may need to iterate over >100 GB of WAL in order to
> >> fetch <1% of needed updates. Even though the amount of network traffic
> is
> >> still optimized, it would be more effective to transfer partitions with
> >> ~1000 entries fully instead of reading >100 GB of WAL.
> >>
> >> I want to highlight that your heuristic definitely makes the situation
> >> better, but due to possible corner cases we should keep the fallback
> lever
> >> to restrict or limit historical rebalance as before. Probably, it would
> be
> >> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
> >> default value (1000, 500 or even 0) and apply your heuristic only for
> >> partitions with bigger size.
> >>
> >> Regarding case [2]: it looks like an improvement that can mitigate some
> >> corner cases (including the one that I have described). I'm ok with it
> as
> >> long as it takes data updates reordering on backup nodes into account.
> We
> >> don't track skipped updates for atomic caches. As a result, detection of
> >> the absence of updates between two checkpoint markers with the same
> >> partition counter can be false positive.
> >>
> >> --
> >> Best Regards,
> >> Ivan Rakov
> >>
> >> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov  >
> >> wrote:
> >>
> >> > Hi guys,
> >> >
> >> > I want to implement a more honest heuristic for historical re

Re: Choosing historical rebalance heuristics

2020-07-16 Thread Vladislav Pyatkov
I completely forget about another promise to favor of using historical
rebalance where it is possible. When cluster decided to use a full balance,
demander nodes should clear not empty partitions.
This can to consume a long time, in some cases that may be compared with a
time of rebalance.
It also accepts a side of heuristics above.

On Thu, Jul 16, 2020 at 12:09 AM Vladislav Pyatkov 
wrote:

> Ivan,
>
> I agree with a combined approach: threshold for small partitions and count
> of update for partition that outgrew it.
> This helps to avoid partitions that update not frequently.
>
> Reading of a big WAL piece (more than 100Gb) it can happen, when a client
> configured it intentionally.
> There are no doubts we can to read it, otherwise WAL space was not
> configured that too large.
>
> I don't see a connection optimization of iterator and issue in atomic
> protocol.
> Reordering in WAL, that happened in checkpoint where counter was not
> changing, is an extremely rare case and the issue will not solve for
> generic case, this should be fixed in bound of protocol.
>
> I think we can modify the heuristic so
> 1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
> reduce it to 500)
> 2) Select only that partition for historical rebalance where difference
> between counters less that partition size.
>
> Also implement mentioned optimization for historical iterator, that may
> reduce a time on reading large WAL interval.
>
> On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov  wrote:
>
>> Hi Vladislav,
>>
>> Thanks for raising this topic.
>> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
>> is controversial. Assuming that the default number of partitions is 1024,
>> cache should contain a really huge amount of data in order to make WAL
>> delta rebalancing possible. In fact, it's currently disabled for most
>> production cases, which makes rebalancing of persistent caches
>> unreasonably
>> long.
>>
>> I think, your approach [1] makes much more sense than the current
>> heuristic, let's move forward with the proposed solution.
>>
>> Though, there are some other corner cases, e.g. this one:
>> - Configured size of WAL archive is big (>100 GB)
>> - Cache has small partitions (e.g. 1000 entries)
>> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
>> - There is another cache with very frequent updates which allocate >99% of
>> WAL
>> In such scenario we may need to iterate over >100 GB of WAL in order to
>> fetch <1% of needed updates. Even though the amount of network traffic is
>> still optimized, it would be more effective to transfer partitions with
>> ~1000 entries fully instead of reading >100 GB of WAL.
>>
>> I want to highlight that your heuristic definitely makes the situation
>> better, but due to possible corner cases we should keep the fallback lever
>> to restrict or limit historical rebalance as before. Probably, it would be
>> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
>> default value (1000, 500 or even 0) and apply your heuristic only for
>> partitions with bigger size.
>>
>> Regarding case [2]: it looks like an improvement that can mitigate some
>> corner cases (including the one that I have described). I'm ok with it as
>> long as it takes data updates reordering on backup nodes into account. We
>> don't track skipped updates for atomic caches. As a result, detection of
>> the absence of updates between two checkpoint markers with the same
>> partition counter can be false positive.
>>
>> --
>> Best Regards,
>> Ivan Rakov
>>
>> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov 
>> wrote:
>>
>> > Hi guys,
>> >
>> > I want to implement a more honest heuristic for historical rebalance.
>> > Before, a cluster makes a choice between the historical rebalance or
>> not it
>> > only from a partition size. This threshold more known by a name of
>> property
>> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
>> > It might prevent a historical rebalance when a partition is too small,
>> but
>> > not if WAL contains more updates than a size of partition, historical
>> > rebalance still can be chosen.
>> > There is a ticket where need to implement more fair heuristic[1].
>> >
>> > My idea for implementation is need to estimate a size of data which
>> will be
>> > transferred owe network. In other word if need to rebalance a part of
>> WAL
>> > that contains N updates, for recover a partition on another node,

Re: Choosing historical rebalance heuristics

2020-07-15 Thread Vladislav Pyatkov
Ivan,

I agree with a combined approach: threshold for small partitions and count
of update for partition that outgrew it.
This helps to avoid partitions that update not frequently.

Reading of a big WAL piece (more than 100Gb) it can happen, when a client
configured it intentionally.
There are no doubts we can to read it, otherwise WAL space was not
configured that too large.

I don't see a connection optimization of iterator and issue in atomic
protocol.
Reordering in WAL, that happened in checkpoint where counter was not
changing, is an extremely rare case and the issue will not solve for
generic case, this should be fixed in bound of protocol.

I think we can modify the heuristic so
1) Exclude partitions by threshold (IGNITE_PDS_WAL_REBALANCE_THRESHOLD -
reduce it to 500)
2) Select only that partition for historical rebalance where difference
between counters less that partition size.

Also implement mentioned optimization for historical iterator, that may
reduce a time on reading large WAL interval.

On Wed, Jul 15, 2020 at 3:15 PM Ivan Rakov  wrote:

> Hi Vladislav,
>
> Thanks for raising this topic.
> Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
> is controversial. Assuming that the default number of partitions is 1024,
> cache should contain a really huge amount of data in order to make WAL
> delta rebalancing possible. In fact, it's currently disabled for most
> production cases, which makes rebalancing of persistent caches unreasonably
> long.
>
> I think, your approach [1] makes much more sense than the current
> heuristic, let's move forward with the proposed solution.
>
> Though, there are some other corner cases, e.g. this one:
> - Configured size of WAL archive is big (>100 GB)
> - Cache has small partitions (e.g. 1000 entries)
> - Infrequent updates (e.g. ~100 in the whole WAL history of any node)
> - There is another cache with very frequent updates which allocate >99% of
> WAL
> In such scenario we may need to iterate over >100 GB of WAL in order to
> fetch <1% of needed updates. Even though the amount of network traffic is
> still optimized, it would be more effective to transfer partitions with
> ~1000 entries fully instead of reading >100 GB of WAL.
>
> I want to highlight that your heuristic definitely makes the situation
> better, but due to possible corner cases we should keep the fallback lever
> to restrict or limit historical rebalance as before. Probably, it would be
> handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
> default value (1000, 500 or even 0) and apply your heuristic only for
> partitions with bigger size.
>
> Regarding case [2]: it looks like an improvement that can mitigate some
> corner cases (including the one that I have described). I'm ok with it as
> long as it takes data updates reordering on backup nodes into account. We
> don't track skipped updates for atomic caches. As a result, detection of
> the absence of updates between two checkpoint markers with the same
> partition counter can be false positive.
>
> --
> Best Regards,
> Ivan Rakov
>
> On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov 
> wrote:
>
> > Hi guys,
> >
> > I want to implement a more honest heuristic for historical rebalance.
> > Before, a cluster makes a choice between the historical rebalance or not
> it
> > only from a partition size. This threshold more known by a name of
> property
> > IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> > It might prevent a historical rebalance when a partition is too small,
> but
> > not if WAL contains more updates than a size of partition, historical
> > rebalance still can be chosen.
> > There is a ticket where need to implement more fair heuristic[1].
> >
> > My idea for implementation is need to estimate a size of data which will
> be
> > transferred owe network. In other word if need to rebalance a part of WAL
> > that contains N updates, for recover a partition on another node, which
> > have to contain M rows at all, need chooses a historical rebalance on the
> > case where N < M (WAL history should be presented as well).
> >
> > This approach is easy implemented, because a coordinator node has the
> size
> > of partitions and counters' interval. But in this case cluster still can
> > find not many updates in too long WAL history. I assume a possibility to
> > work around it, if rebalance historical iterator will not handle
> > checkpoints where not contains updates of particular cache. Checkpoints
> can
> > skip if counters for the cache (maybe even a specific partitions) was not
> > changed between it and next one.
> >
> > Ticket for improvement rebalance historical iterator[2]
> >
> > I want to hear a view of community on the thought above.
> > Maybe anyone has another opinion?
> >
> > [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> > [2]: https://issues.apache.org/jira/browse/IGNITE-13254
> >
> > --
> > Vladislav Pyatkov
> >
>


-- 
Vladislav Pyatkov


Re: Choosing historical rebalance heuristics

2020-07-15 Thread Ivan Rakov
Hi Vladislav,

Thanks for raising this topic.
Currently present IGNITE_PDS_WAL_REBALANCE_THRESHOLD (default is 500_000)
is controversial. Assuming that the default number of partitions is 1024,
cache should contain a really huge amount of data in order to make WAL
delta rebalancing possible. In fact, it's currently disabled for most
production cases, which makes rebalancing of persistent caches unreasonably
long.

I think, your approach [1] makes much more sense than the current
heuristic, let's move forward with the proposed solution.

Though, there are some other corner cases, e.g. this one:
- Configured size of WAL archive is big (>100 GB)
- Cache has small partitions (e.g. 1000 entries)
- Infrequent updates (e.g. ~100 in the whole WAL history of any node)
- There is another cache with very frequent updates which allocate >99% of
WAL
In such scenario we may need to iterate over >100 GB of WAL in order to
fetch <1% of needed updates. Even though the amount of network traffic is
still optimized, it would be more effective to transfer partitions with
~1000 entries fully instead of reading >100 GB of WAL.

I want to highlight that your heuristic definitely makes the situation
better, but due to possible corner cases we should keep the fallback lever
to restrict or limit historical rebalance as before. Probably, it would be
handy to keep IGNITE_PDS_WAL_REBALANCE_THRESHOLD property with a low
default value (1000, 500 or even 0) and apply your heuristic only for
partitions with bigger size.

Regarding case [2]: it looks like an improvement that can mitigate some
corner cases (including the one that I have described). I'm ok with it as
long as it takes data updates reordering on backup nodes into account. We
don't track skipped updates for atomic caches. As a result, detection of
the absence of updates between two checkpoint markers with the same
partition counter can be false positive.

--
Best Regards,
Ivan Rakov

On Tue, Jul 14, 2020 at 3:03 PM Vladislav Pyatkov 
wrote:

> Hi guys,
>
> I want to implement a more honest heuristic for historical rebalance.
> Before, a cluster makes a choice between the historical rebalance or not it
> only from a partition size. This threshold more known by a name of property
> IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
> It might prevent a historical rebalance when a partition is too small, but
> not if WAL contains more updates than a size of partition, historical
> rebalance still can be chosen.
> There is a ticket where need to implement more fair heuristic[1].
>
> My idea for implementation is need to estimate a size of data which will be
> transferred owe network. In other word if need to rebalance a part of WAL
> that contains N updates, for recover a partition on another node, which
> have to contain M rows at all, need chooses a historical rebalance on the
> case where N < M (WAL history should be presented as well).
>
> This approach is easy implemented, because a coordinator node has the size
> of partitions and counters' interval. But in this case cluster still can
> find not many updates in too long WAL history. I assume a possibility to
> work around it, if rebalance historical iterator will not handle
> checkpoints where not contains updates of particular cache. Checkpoints can
> skip if counters for the cache (maybe even a specific partitions) was not
> changed between it and next one.
>
> Ticket for improvement rebalance historical iterator[2]
>
> I want to hear a view of community on the thought above.
> Maybe anyone has another opinion?
>
> [1]: https://issues.apache.org/jira/browse/IGNITE-13253
> [2]: https://issues.apache.org/jira/browse/IGNITE-13254
>
> --
> Vladislav Pyatkov
>


Choosing historical rebalance heuristics

2020-07-14 Thread Vladislav Pyatkov
Hi guys,

I want to implement a more honest heuristic for historical rebalance.
Before, a cluster makes a choice between the historical rebalance or not it
only from a partition size. This threshold more known by a name of property
IGNITE_PDS_WAL_REBALANCE_THRESHOLD.
It might prevent a historical rebalance when a partition is too small, but
not if WAL contains more updates than a size of partition, historical
rebalance still can be chosen.
There is a ticket where need to implement more fair heuristic[1].

My idea for implementation is need to estimate a size of data which will be
transferred owe network. In other word if need to rebalance a part of WAL
that contains N updates, for recover a partition on another node, which
have to contain M rows at all, need chooses a historical rebalance on the
case where N < M (WAL history should be presented as well).

This approach is easy implemented, because a coordinator node has the size
of partitions and counters' interval. But in this case cluster still can
find not many updates in too long WAL history. I assume a possibility to
work around it, if rebalance historical iterator will not handle
checkpoints where not contains updates of particular cache. Checkpoints can
skip if counters for the cache (maybe even a specific partitions) was not
changed between it and next one.

Ticket for improvement rebalance historical iterator[2]

I want to hear a view of community on the thought above.
Maybe anyone has another opinion?

[1]: https://issues.apache.org/jira/browse/IGNITE-13253
[2]: https://issues.apache.org/jira/browse/IGNITE-13254

-- 
Vladislav Pyatkov


[jira] [Created] (IGNITE-13254) Historical rebalance iterator may skip checkpoint if it not contains updates

2020-07-14 Thread Vladislav Pyatkov (Jira)
Vladislav Pyatkov created IGNITE-13254:
--

 Summary: Historical rebalance iterator may skip checkpoint if it 
not contains updates
 Key: IGNITE-13254
 URL: https://issues.apache.org/jira/browse/IGNITE-13254
 Project: Ignite
  Issue Type: Improvement
Reporter: Vladislav Pyatkov






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13253) Advanced heuristics for historical rebalance

2020-07-14 Thread Vladislav Pyatkov (Jira)
Vladislav Pyatkov created IGNITE-13253:
--

 Summary: Advanced heuristics for historical rebalance
 Key: IGNITE-13253
 URL: https://issues.apache.org/jira/browse/IGNITE-13253
 Project: Ignite
  Issue Type: Improvement
Reporter: Vladislav Pyatkov


Before, cluster detects partitions that have not to rebalance by history, by 
them size. This threshold might be set through a system property 
IGNITE_PDS_WAL_REBALANCE_THRESHOLD. But it is not fair deciding which 
partitions will be rebalanced by WAL only by them size. WAL can have much more 
records than size of a partition (many update by one key) and that rebalance 
required more data than full transferring by network.
Need to implement a heuristic, that might to estimate data size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-13168) Retrigger historical rebalance if it was cancelled in case WAL history is still available

2020-06-19 Thread Vladislav Pyatkov (Jira)
Vladislav Pyatkov created IGNITE-13168:
--

 Summary: Retrigger historical rebalance if it was cancelled in 
case WAL history is still available
 Key: IGNITE-13168
 URL: https://issues.apache.org/jira/browse/IGNITE-13168
 Project: Ignite
  Issue Type: Improvement
Reporter: Vladislav Pyatkov


If historical rebalance is cancelled, full rebalance will be unconditionally 
triggered on the PME that caused the cancellation (only outdated OWNING 
partitions can be rebalanced by history in the current implementation).
We have to allow MOVING partitions to be historically rebalanced as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12935) Disadvantages in log of historical rebalance

2020-04-24 Thread Vladislav Pyatkov (Jira)
Vladislav Pyatkov created IGNITE-12935:
--

 Summary: Disadvantages in log of historical rebalance
 Key: IGNITE-12935
 URL: https://issues.apache.org/jira/browse/IGNITE-12935
 Project: Ignite
  Issue Type: Improvement
Reporter: Vladislav Pyatkov


# Mention in the log only partitions for which there are no nodes that suit as 
historical supplier
 For these partitions, print minimal counter (since which we should perform 
historical rebalancing) with corresponding node and maximum reserved counter 
(since which cluster can perform historical rebalancing) with corresponding 
node.
This will let us know:
 ## Whether history was reserved at all
 ## How much reserved history we lack to perform a historical rebalancing
 ## I see resulting output like this:
Historical rebalancing wasn't scheduled for some partitions:
 History wasn't reserved for: [list of partitions and groups]
 History was reserved, but minimum present counter is less than maximum 
reserved: [[grp=GRP, part=ID, minCntr=cntr, minNodeId=ID, maxReserved=cntr, 
maxReservedNodeId=ID], ...]
 ## We can also aggregate previous message by (minNodeId) to easily find the 
exact node (or nodes) which were the reason of full rebalance.
 # Log results of reserveHistoryForExchange(). They can be compactly 
represented as mappings: (grpId -> checkpoint (id, timestamp)). For every 
group, also log message about why the previous checkpoint wasn't successfully 
reserved.
There can be three reasons:
 ## Previous checkpoint simply isn't present in the history (the oldest is 
reserved)
 ## WAL reservation failure (call below returned false)

{code:java}
chpEntry = entry(cpTs);boolean reserved =  
cctx.wal().reserve(chpEntry.checkpointMark());// If checkpoint WAL history 
can't be reserved, stop searching.
if (!reserved)
  break;
{code}

 ## Checkpoint was marked as inapplicable for historical rebalancing

{code:java}
for (Integer grpId : new HashSet<>(groupsAndPartitions.keySet()))
  if (!isCheckpointApplicableForGroup(grpId, chpEntry))
groupsAndPartitions.remove(grpId);
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12495) Make historical rebalance wokring per-partition level

2019-12-25 Thread Maxim Muzafarov (Jira)
Maxim Muzafarov created IGNITE-12495:


 Summary: Make historical rebalance wokring per-partition level
 Key: IGNITE-12495
 URL: https://issues.apache.org/jira/browse/IGNITE-12495
 Project: Ignite
  Issue Type: Sub-task
Reporter: Maxim Muzafarov


Currently, historical rebalance run after all cache group partition files have 
been preloaded and inited. 
To make file-rebalance faster historical rebalance can be started on 
per-partition level right after each cache group partition file loaded and 
inited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12429) Rework bytes-based WAL archive size management logic to make historical rebalance more predictable

2019-12-09 Thread Ivan Rakov (Jira)
Ivan Rakov created IGNITE-12429:
---

 Summary: Rework bytes-based WAL archive size management logic to 
make historical rebalance more predictable
 Key: IGNITE-12429
 URL: https://issues.apache.org/jira/browse/IGNITE-12429
 Project: Ignite
  Issue Type: Improvement
Reporter: Ivan Rakov


Since 2.7 DataStorageConfiguration allows to specify size of WAL archive in 
bytes (see DataStorageConfiguration#maxWalArchiveSize), which is much more 
trasparent to user. 
Unfortunately, new logic may be unpredictable when it comes to the historical 
rebalance. WAL archive is truncated when one of the following conditions occur:
1. Total number of checkpoints in WAL archive is bigger than 
DataStorageConfiguration#walHistSize
2. Total size of WAL archive is bigger than 
DataStorageConfiguration#maxWalArchiveSize
Independently, in-memory checkpoint history contains only fixed number of last 
checkpoints (can be changed with IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE, 
100 by default).
All these particular qualities make it hard for user to cotrol usage of 
historical rebalance. Imagine the case when user has slight load (WAL gets 
rotated very slowly) and default checkpoint frequency. After 100 * 3 = 300 
minutes, all updates in WAL will be impossible to be received via historical 
rebalance even if:
1. User has configured large DataStorageConfiguration#maxWalArchiveSize
2. User has configured large DataStorageConfiguration#walHistSize
At the same time, setting large IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE 
will help (only with previous two points combined), but Ignite node heap usage 
may increase dramatically.
I propose to change WAL history management logic in the following way:
1. *Don't* cut WAL archive when number of checkpoint exceeds 
DataStorageConfiguration#walHistSize. WAL history should be managed only based 
on DataStorageConfiguration#maxWalArchiveSize.
2. Checkpoint history should contain fixed number of entries, but should cover 
the whole stored WAL archive (not only its more recent part with 
IGNITE_PDS_MAX_CHECKPOINT_MEMORY_HISTORY_SIZE last checkpoints). This can be 
achieved by making checkpoint history sparse: some intermediate checkpoints 
*may be not present in history*, but fixed number of checkpoints can be 
positioned either in uniform distribution (trying to keep fixed number of bytes 
between two neighbour checkpoints) or exponentially (trying to keep fixed ratio 
between (size of WAL from checkpoint(N-1) to current write pointer) and (size 
of WAL from checkpoint(N) to current write pointer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (IGNITE-12117) Historical rebalance should not be processed in striped way

2019-08-28 Thread Anton Vinogradov (Jira)
Anton Vinogradov created IGNITE-12117:
-

 Summary: Historical rebalance should not be processed in striped 
way
 Key: IGNITE-12117
 URL: https://issues.apache.org/jira/browse/IGNITE-12117
 Project: Ignite
  Issue Type: Task
Reporter: Anton Vinogradov


We have to investigate is it possible to process historical rebalance like 
regular (using unstriped pool) and if possible then refactor it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (IGNITE-11607) Historical rebalance is not possible from partition which was recently rebalanced itself

2019-03-22 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-11607:
--

 Summary: Historical rebalance is not possible from partition which 
was recently rebalanced itself
 Key: IGNITE-11607
 URL: https://issues.apache.org/jira/browse/IGNITE-11607
 Project: Ignite
  Issue Type: Improvement
Reporter: Alexei Scherbakov
 Fix For: 2.8






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Historical rebalance

2018-12-03 Thread Vladimir Ozerov
Roman,

What is the advantage of your algorithm compared to previous one? Previous
algorithm does almost the same, but without updating two separate counters,
and looks simpler to me. Only one update is sufficient - at transaction
commit. When transaction starts we just read currently active update
counter (LWM), which is enough for us to know where to start from.
Moreover, we do not need to learn any kind of WAL pointers and write
additional WAL records.

Please note that we are trying to solve more difficult problem - how to
rebalance as less WAL as possible in case of long-running transactions.

On Mon, Dec 3, 2018 at 2:29 PM Roman Kondakov 
wrote:

> Vladimir,
>
> the difference between per-transaction basis and update-counters basis
> is the fact that at the first update we don't know the actual update
> counter of this update - we just count deltas on enlist phase. Actual
> update counter of this update will be assigned on transaction commit.
> But for per-transaction based the actual HWM is known for each
> transaction from the very beginning and this value is the same for
> primary and backups. Having this number it is very easy to find where
> transaction begins on any node.
>
>
> --
> Kind Regards
> Roman Kondakov
>
> On 03.12.2018 13:46, Vladimir Ozerov wrote:
> > Roman,
> >
> > We already track updates on per-transaction basis. The only difference is
> > that instead of doing a single "increment(1)" for transaction we do
> > "increment(X)" where X is number of updates in the given transaction.
> >
> > On Mon, Dec 3, 2018 at 1:16 PM Roman Kondakov  >
> > wrote:
> >
> >> Igor, Vladimir, Ivan,
> >>
> >> perhaps, we are focused too much on update counters. This feature was
> >> designed for the continuous queries and it may not be suited well for
> >> the historical rebalance. What if we would track updates on
> >> per-transaction basis instead of per-update basis? Let's consider two
> >> counters: low-water mark (LWM) and high-water mark (HWM) which should be
> >> added to each partition. They have the following properties:
> >>
> >> * HWM - is a plane atomic counter. When Tx makes its first write on
> >> primary node it does incrementAndGet for this counter and remembers
> >> obtained value within its context. This counter can be considered as tx
> >> id within current partition - transactions should maintain per-partition
> >> map of their HWM ids. WAL pointer to the first record should remembered
> >> in this map. Also this id should be recorded to WAL data records.
> >>
> >> When Tx sends updates to backups it sends Tx HWM too. When backup
> >> receives this message from the primary node it takes HWM and do
> >> setIfGreater on the local HWM counter.
> >>
> >> * LWM - is a plane atomic counter. When Tx terminates (either with
> >> commit or rollback) it updates its local LWM in the same manner as
> >> update counters do it using holes tracking. For example, if partition's
> >> LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update
> >> partition LWM until tx with id = 11 is committed. When id = 11 is
> >> committed, LWM is set to 12. If we have LWM == N, this means that all
> >> transactions with id <= N have been terminated for the current partition
> >> and all data is already recorded in the local partition.
> >>
> >> Brief summary for both counters: HWM - means that partition has already
> >> seen at least one update of transactions with id <= HWM. LWM means that
> >> partition has all updates made by transactions wth id <= LWM.
> >>
> >> LWM is always <= HWM.
> >>
> >> On checkpoint we should store only these two counters in checkpoint
> >> record. As optimization we can also store list of pending LWMs - ids
> >> which haven't been merged to LWM because of the holes in sequence.
> >>
> >> Historical rebalance:
> >>
> >> 1. Demander knows its LWM - all updates before it has been applied.
> >> Demander sends LWM to supplier.
> >>
> >> 2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM
> >> (demander)
> >>
> >> 3. Supplier starts moving forward on WAL until it finds first data
> >> record with HWM id = LWM (demander). From this point WAL can be
> >> rebalanced to demander.
> >>
> >> In this approach updates and checkpoints on primary and backup can be
> >> reordered in any wa

Re: Historical rebalance

2018-12-03 Thread Roman Kondakov

Vladimir,

the difference between per-transaction basis and update-counters basis 
is the fact that at the first update we don't know the actual update 
counter of this update - we just count deltas on enlist phase. Actual 
update counter of this update will be assigned on transaction commit. 
But for per-transaction based the actual HWM is known for each 
transaction from the very beginning and this value is the same for 
primary and backups. Having this number it is very easy to find where 
transaction begins on any node.



--
Kind Regards
Roman Kondakov

On 03.12.2018 13:46, Vladimir Ozerov wrote:

Roman,

We already track updates on per-transaction basis. The only difference is
that instead of doing a single "increment(1)" for transaction we do
"increment(X)" where X is number of updates in the given transaction.

On Mon, Dec 3, 2018 at 1:16 PM Roman Kondakov 
wrote:


Igor, Vladimir, Ivan,

perhaps, we are focused too much on update counters. This feature was
designed for the continuous queries and it may not be suited well for
the historical rebalance. What if we would track updates on
per-transaction basis instead of per-update basis? Let's consider two
counters: low-water mark (LWM) and high-water mark (HWM) which should be
added to each partition. They have the following properties:

* HWM - is a plane atomic counter. When Tx makes its first write on
primary node it does incrementAndGet for this counter and remembers
obtained value within its context. This counter can be considered as tx
id within current partition - transactions should maintain per-partition
map of their HWM ids. WAL pointer to the first record should remembered
in this map. Also this id should be recorded to WAL data records.

When Tx sends updates to backups it sends Tx HWM too. When backup
receives this message from the primary node it takes HWM and do
setIfGreater on the local HWM counter.

* LWM - is a plane atomic counter. When Tx terminates (either with
commit or rollback) it updates its local LWM in the same manner as
update counters do it using holes tracking. For example, if partition's
LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update
partition LWM until tx with id = 11 is committed. When id = 11 is
committed, LWM is set to 12. If we have LWM == N, this means that all
transactions with id <= N have been terminated for the current partition
and all data is already recorded in the local partition.

Brief summary for both counters: HWM - means that partition has already
seen at least one update of transactions with id <= HWM. LWM means that
partition has all updates made by transactions wth id <= LWM.

LWM is always <= HWM.

On checkpoint we should store only these two counters in checkpoint
record. As optimization we can also store list of pending LWMs - ids
which haven't been merged to LWM because of the holes in sequence.

Historical rebalance:

1. Demander knows its LWM - all updates before it has been applied.
Demander sends LWM to supplier.

2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM
(demander)

3. Supplier starts moving forward on WAL until it finds first data
record with HWM id = LWM (demander). From this point WAL can be
rebalanced to demander.

In this approach updates and checkpoints on primary and backup can be
reordered in any way, but we can always find a proper point to read WAL
from.

Let's consider a couple of examples. In this examples transaction
updates marked as w1(a) - transaction 1 updates key=a, c1 - transaction
1 is committed, cp(1, 0) - checkpoint with HWM=1 and  LWM=0. (HWM,LWM) -
current counters after operation. (HWM,LWM[hole1, hole2]) - counters
with holes in LWM.


1. Simple case with no reordering:

PRIMARY
-w1(a)---cp(1,0)---w2(b)w1(c)--c1c2-cp(2,2)
(HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2)
|  || ||
BACKUP
--w1(a)-w2(b)w1(c)---cp(2,0)c1c2-cp(2,2)
(HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2)


In this case if backup failed before c1 it will receive all updates from
the beginning (HWM=0).
If it fails between c1 and c2, it will receive WAL from primary's cp(1,0),
because tx with id=1 is fully processed on backup: HWM(supplier cp(1,0))=1
== LWM(demander)=1
if backup fails after c2, it will receive nothing because it has all
updates HWM(supplier)=2 == LWM(demander)=2



2. Case with reordering

PRIMARY
-w1(a)---cp(1,0)---w2(b)--cp(2,0)--w1(c)--c1-c2---cp(2,2)
(HWM,LWM)(1,0) (2,0)   

Re: Historical rebalance

2018-12-03 Thread Vladimir Ozerov
Roman,

We already track updates on per-transaction basis. The only difference is
that instead of doing a single "increment(1)" for transaction we do
"increment(X)" where X is number of updates in the given transaction.

On Mon, Dec 3, 2018 at 1:16 PM Roman Kondakov 
wrote:

> Igor, Vladimir, Ivan,
>
> perhaps, we are focused too much on update counters. This feature was
> designed for the continuous queries and it may not be suited well for
> the historical rebalance. What if we would track updates on
> per-transaction basis instead of per-update basis? Let's consider two
> counters: low-water mark (LWM) and high-water mark (HWM) which should be
> added to each partition. They have the following properties:
>
> * HWM - is a plane atomic counter. When Tx makes its first write on
> primary node it does incrementAndGet for this counter and remembers
> obtained value within its context. This counter can be considered as tx
> id within current partition - transactions should maintain per-partition
> map of their HWM ids. WAL pointer to the first record should remembered
> in this map. Also this id should be recorded to WAL data records.
>
> When Tx sends updates to backups it sends Tx HWM too. When backup
> receives this message from the primary node it takes HWM and do
> setIfGreater on the local HWM counter.
>
> * LWM - is a plane atomic counter. When Tx terminates (either with
> commit or rollback) it updates its local LWM in the same manner as
> update counters do it using holes tracking. For example, if partition's
> LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update
> partition LWM until tx with id = 11 is committed. When id = 11 is
> committed, LWM is set to 12. If we have LWM == N, this means that all
> transactions with id <= N have been terminated for the current partition
> and all data is already recorded in the local partition.
>
> Brief summary for both counters: HWM - means that partition has already
> seen at least one update of transactions with id <= HWM. LWM means that
> partition has all updates made by transactions wth id <= LWM.
>
> LWM is always <= HWM.
>
> On checkpoint we should store only these two counters in checkpoint
> record. As optimization we can also store list of pending LWMs - ids
> which haven't been merged to LWM because of the holes in sequence.
>
> Historical rebalance:
>
> 1. Demander knows its LWM - all updates before it has been applied.
> Demander sends LWM to supplier.
>
> 2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM
> (demander)
>
> 3. Supplier starts moving forward on WAL until it finds first data
> record with HWM id = LWM (demander). From this point WAL can be
> rebalanced to demander.
>
> In this approach updates and checkpoints on primary and backup can be
> reordered in any way, but we can always find a proper point to read WAL
> from.
>
> Let's consider a couple of examples. In this examples transaction
> updates marked as w1(a) - transaction 1 updates key=a, c1 - transaction
> 1 is committed, cp(1, 0) - checkpoint with HWM=1 and  LWM=0. (HWM,LWM) -
> current counters after operation. (HWM,LWM[hole1, hole2]) - counters
> with holes in LWM.
>
>
> 1. Simple case with no reordering:
>
> PRIMARY
> -w1(a)---cp(1,0)---w2(b)w1(c)--c1c2-cp(2,2)
> (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2)
>|  || ||
> BACKUP
> --w1(a)-w2(b)w1(c)---cp(2,0)c1c2-cp(2,2)
> (HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2)
>
>
> In this case if backup failed before c1 it will receive all updates from
> the beginning (HWM=0).
> If it fails between c1 and c2, it will receive WAL from primary's cp(1,0),
> because tx with id=1 is fully processed on backup: HWM(supplier cp(1,0))=1
> == LWM(demander)=1
> if backup fails after c2, it will receive nothing because it has all
> updates HWM(supplier)=2 == LWM(demander)=2
>
>
>
> 2. Case with reordering
>
> PRIMARY
> -w1(a)---cp(1,0)---w2(b)--cp(2,0)--w1(c)--c1-c2---cp(2,2)
> (HWM,LWM)(1,0) (2,0)   (2,0)
>  (2,1)  (2,2)
>  \_   |   |
> \   |
>\___   |   |
>  \__|___
>\__|__ |
>   |   \
>   |  \|
>   |\
> BACKUP
> -w2(b)---w1(a)cp(2,0)---w1(c)c2---c1-

Re: Historical rebalance

2018-12-03 Thread Roman Kondakov

Igor, Vladimir, Ivan,

perhaps, we are focused too much on update counters. This feature was 
designed for the continuous queries and it may not be suited well for 
the historical rebalance. What if we would track updates on 
per-transaction basis instead of per-update basis? Let's consider two 
counters: low-water mark (LWM) and high-water mark (HWM) which should be 
added to each partition. They have the following properties:


* HWM - is a plane atomic counter. When Tx makes its first write on 
primary node it does incrementAndGet for this counter and remembers 
obtained value within its context. This counter can be considered as tx 
id within current partition - transactions should maintain per-partition 
map of their HWM ids. WAL pointer to the first record should remembered 
in this map. Also this id should be recorded to WAL data records.


When Tx sends updates to backups it sends Tx HWM too. When backup 
receives this message from the primary node it takes HWM and do 
setIfGreater on the local HWM counter.


* LWM - is a plane atomic counter. When Tx terminates (either with 
commit or rollback) it updates its local LWM in the same manner as 
update counters do it using holes tracking. For example, if partition's 
LWM = 10 now, and tx with id (HWM id) = 12 commits, we do not update 
partition LWM until tx with id = 11 is committed. When id = 11 is 
committed, LWM is set to 12. If we have LWM == N, this means that all 
transactions with id <= N have been terminated for the current partition 
and all data is already recorded in the local partition.


Brief summary for both counters: HWM - means that partition has already 
seen at least one update of transactions with id <= HWM. LWM means that 
partition has all updates made by transactions wth id <= LWM.


LWM is always <= HWM.

On checkpoint we should store only these two counters in checkpoint 
record. As optimization we can also store list of pending LWMs - ids 
which haven't been merged to LWM because of the holes in sequence.


Historical rebalance:

1. Demander knows its LWM - all updates before it has been applied. 
Demander sends LWM to supplier.


2. Supplier finds the earliest checkpoint where HWM(supplier) <= LWM 
(demander)


3. Supplier starts moving forward on WAL until it finds first data 
record with HWM id = LWM (demander). From this point WAL can be 
rebalanced to demander.


In this approach updates and checkpoints on primary and backup can be 
reordered in any way, but we can always find a proper point to read WAL 
from.


Let's consider a couple of examples. In this examples transaction 
updates marked as w1(a) - transaction 1 updates key=a, c1 - transaction 
1 is committed, cp(1, 0) - checkpoint with HWM=1 and  LWM=0. (HWM,LWM) - 
current counters after operation. (HWM,LWM[hole1, hole2]) - counters 
with holes in LWM.



1. Simple case with no reordering:

PRIMARY 
-w1(a)---cp(1,0)---w2(b)w1(c)--c1c2-cp(2,2)
(HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2)
  |  || ||
BACKUP 
--w1(a)-w2(b)w1(c)---cp(2,0)c1c2-cp(2,2)
(HWM,LWM)(1,0) (2,0)(2,0) (2,1) (2,2)

  
In this case if backup failed before c1 it will receive all updates from the beginning (HWM=0).

If it fails between c1 and c2, it will receive WAL from primary's cp(1,0), 
because tx with id=1 is fully processed on backup: HWM(supplier cp(1,0))=1 == 
LWM(demander)=1
if backup fails after c2, it will receive nothing because it has all updates 
HWM(supplier)=2 == LWM(demander)=2



2. Case with reordering

PRIMARY 
-w1(a)---cp(1,0)---w2(b)--cp(2,0)--w1(c)--c1-c2---cp(2,2)
(HWM,LWM)(1,0) (2,0)   (2,0) (2,1)  
(2,2)
\_   |   |  \   |
  \___   |   |   
\__|___
  \__|__ |  |   
\
 |  \|  |   
 \
BACKUP 
-w2(b)---w1(a)cp(2,0)---w1(c)c2---c1-cp(2,2)
(HWM,LWM)   (2,0)   (2,0)  (2,0)  
(2,0[2])  (2,2)


Note here we have a hole on backup when tx2 has committed earlier than tx1 and 
LWM wasn't changed at this moment.

In last case if backup is failed before c1, the entire WAL will be supplied 
because LWM=0 until this moment.
If backup fails after c1 - there is nothing to rebalance, because 
HWM(supplier)=2 == LWM(demander)=2


What do you think?


--
Kind Regards
Roman Kondakov

On 30.11.2018 2:01, Seliverstov Igor wrote:

Vladimir,

Look at my example:

One active transaction (Tx1 which does opX ops) while another tx (Tx2 which
does opX' ops) is fi

Re: Historical rebalance

2018-11-29 Thread Seliverstov Igor
Vladimir,

Look at my example:

One active transaction (Tx1 which does opX ops) while another tx (Tx2 which
does opX' ops) is finishes with uc4:

uc1--op1op2---uc2--op1'uc3--uc4---op3X-
Node1



uc1op1uc2op2uc3--op3--uc4cp1 Tx1 -
^ |  |
   |
 
| -Node2
  ^--
   |
  |
 |
uc1-uc2-uc3op1'uc4cp1
Tx2 -


state on Node2: tx1 -> op3 -> uc2
  cp1 [current=uc4, backpointer=uc2]

Here op2 was acknowledged by op3, op3 was applied before op1' (linearized
by WAL).

All nodes having uc4 must have op1' because uc4 cannot be get earlier than
prepare stage while prepare stage happens after all updates so *op1'
happens before uc4* regardless Tx2 was committed or rolled back.

This means *op2 happens before uc4* (uc4 cannot be earlier op2 on any node
because on Node2 op2 was already finished (acknowledged by op3) when op1'
happens)

That was my idea which easy to proof.

You used a different approach, but yes, It has to work.

чт, 29 нояб. 2018 г. в 22:19, Vladimir Ozerov :

> "If more recent WAL records will contain *ALL* updates of the transaction"
> -> "More recent WAL records will contain *ALL* updates of the transaction"
>
> On Thu, Nov 29, 2018 at 10:15 PM Vladimir Ozerov 
> wrote:
>
> > Igor,
> >
> > Yes, I tried to draw different configurations, and it really seems to
> > work, despite of being very hard to proof due to non-inituitive HB edges.
> > So let me try to spell the algorithm once again to make sure that we are
> on
> > the same page here.
> >
> > 1) There are two nodes - primary (P) and backup (B)
> > 2) There are three type of events: small transactions which possibly
> > increments update counter (ucX), one long active transaction which is
> split
> > into multiple operations (opX), and checkpoints (cpX)
> > 3) Every node always has current update counter. When transaction commits
> > it may or may not shift this counter further depending on whether there
> are
> > holes behind. But we have a strict rule that it always grow. Higher
> > coutners synchrnoizes with smaller. Possible cases:
> > uc1uc2uc3
> > uc1uc3--- // uc2 missing due to reorder, but is is ok
> >
> > 4) Operations within a single transaction is always applied sequentially,
> > and hence also have HB edge:
> > op1op2op3
> >
> > 5) When transaction operation happens, we save in memory current update
> > counter available at this moment. I.e. we have a map from transaction ID
> to
> > update counter which was relevant by the time last *completed* operation
> > *started*. This is very important thing - we remember the counter when
> > operation starts, but update the map only when it finishes. This is
> needed
> > for situation when update counter is bumber in the middle of a long
> > operation.
> > uc1op1op2uc2uc3op3
> > |  ||
> >uc1uc1  uc3
> >
> > state: tx1 -> op3 -> uc3
> >
> > 6) Whenever checkpoint occurs, we save two counters with: "current" and
> > "backpointer". The latter is the smallest update counter associated with
> > active transactions. If there are no active transactions, current update
> > counter is used.
> >
> > Example 1: no active transactions.
> > uc1cp1
> >  ^  |
> >  
> >
> > state: cp1 [current=uc1, backpointer=uc1]
> >
> > Example 2: one active transaction:
> >  ---
> >  | |
> > uc1op1uc2op2op3uc3cp1
> >^ |
> >--
> >
> > state: tx1 -> op3 -> uc2
> >cp1 [current=uc3, backpointer=uc2]
> >
> > 7) Historical rebalance:
> > 7.1) Demander finds latest checkpoint, get it's backpointer and sends it
> > to supplier.
> > 7.2) Supplier finds earliest checkpoint where [supplier(current) <=
> > demander(backpointer)]
> > 7.3) Supplier reads checkpoint backpointer and finds associated WAL
> > record. This is where

Re: Historical rebalance

2018-11-29 Thread Vladimir Ozerov
Igor,

Yes, I tried to draw different configurations, and it really seems to work,
despite of being very hard to proof due to non-inituitive HB edges. So let
me try to spell the algorithm once again to make sure that we are on the
same page here.

1) There are two nodes - primary (P) and backup (B)
2) There are three type of events: small transactions which possibly
increments update counter (ucX), one long active transaction which is split
into multiple operations (opX), and checkpoints (cpX)
3) Every node always has current update counter. When transaction commits
it may or may not shift this counter further depending on whether there are
holes behind. But we have a strict rule that it always grow. Higher
coutners synchrnoizes with smaller. Possible cases:
uc1uc2uc3
uc1uc3--- // uc2 missing due to reorder, but is is ok

4) Operations within a single transaction is always applied sequentially,
and hence also have HB edge:
op1op2op3

5) When transaction operation happens, we save in memory current update
counter available at this moment. I.e. we have a map from transaction ID to
update counter which was relevant by the time last *completed* operation
*started*. This is very important thing - we remember the counter when
operation starts, but update the map only when it finishes. This is needed
for situation when update counter is bumber in the middle of a long
operation.
uc1op1op2uc2uc3op3
|  ||
   uc1uc1  uc3

state: tx1 -> op3 -> uc3

6) Whenever checkpoint occurs, we save two counters with: "current" and
"backpointer". The latter is the smallest update counter associated with
active transactions. If there are no active transactions, current update
counter is used.

Example 1: no active transactions.
uc1cp1
 ^  |
 

state: cp1 [current=uc1, backpointer=uc1]

Example 2: one active transaction:
 ---
 | |
uc1op1uc2op2op3uc3cp1
   ^ |
   --

state: tx1 -> op3 -> uc2
   cp1 [current=uc3, backpointer=uc2]

7) Historical rebalance:
7.1) Demander finds latest checkpoint, get it's backpointer and sends it to
supplier.
7.2) Supplier finds earliest checkpoint where [supplier(current) <=
demander(backpointer)]
7.3) Supplier reads checkpoint backpointer and finds associated WAL record.
This is where we start.

So in terms of WAL we have: supplier[uc_backpointer <- cp(uc_current <=
demanter_uc_backpointer)] <- demander[uc_backpointer <- cp(last)]

Now the most important - why it works :-)
1) Transaction opeartions are sequential, so at the time of crash nodes are *at
most one operation ahead *each other
2) Demander goes to the past and finds update counter which was current at
the time of last TX completed operation
3) Supplier goes to the closest checkpoint in the past where this update
counter either doesn't exist or just appeared
4) Transaction cannot be committed on supplier at this checkpoint, as it
would violate UC happens-before rule
5) Tranasction may have not started yet on supplier at this point. If more
recent WAL records will contain *ALL* updates of the transaction
6) Transaction may exist on supplier at this checkpoint. Thanks to p.1 we
must skip at most one operation. Jump back through supplier's checkpoint
backpointer is guaranteed to do this.

Igor, do we have the same understanding here?

Vladimir.

On Thu, Nov 29, 2018 at 2:47 PM Seliverstov Igor 
wrote:

> Ivan,
>
> different transactions may be applied in different order on backup nodes.
> That's why we need an active tx set
> and some sorting by their update times. The idea is to identify a point in
> time which starting from we may lost some updates.
> This point:
>1) is the last acknowledged by all backups (including possible further
> demander) update on timeline;
>2) have a specific update counter (aka back-counter) which we going to
> start iteration from.
>
> After additional thinking on, I've identified a rule:
>
> There is two fences:
>   1) update counter (UC) - this means that all updates, with less UC than
> applied one, was applied on a node, having this UC.
>   2) update in scope of TX - all updates are applied one by one
> sequentially, this means that the fact of update guaranties the previous
> update (statement) was finished on all TX participants.
>
> Сombining them, we can say the next:
>
> All updates, that was acknowledged at the time the last update of tx, which
> updated UC, applied, are guaranteed to be presented on a node having such
> UC
>
> We can use this rule to find an iterator start pointer.
>
> ср, 28 нояб. 2018 г. в

Re: Historical rebalance

2018-11-29 Thread Vladimir Ozerov
"If more recent WAL records will contain *ALL* updates of the transaction"
-> "More recent WAL records will contain *ALL* updates of the transaction"

On Thu, Nov 29, 2018 at 10:15 PM Vladimir Ozerov 
wrote:

> Igor,
>
> Yes, I tried to draw different configurations, and it really seems to
> work, despite of being very hard to proof due to non-inituitive HB edges.
> So let me try to spell the algorithm once again to make sure that we are on
> the same page here.
>
> 1) There are two nodes - primary (P) and backup (B)
> 2) There are three type of events: small transactions which possibly
> increments update counter (ucX), one long active transaction which is split
> into multiple operations (opX), and checkpoints (cpX)
> 3) Every node always has current update counter. When transaction commits
> it may or may not shift this counter further depending on whether there are
> holes behind. But we have a strict rule that it always grow. Higher
> coutners synchrnoizes with smaller. Possible cases:
> uc1uc2uc3
> uc1uc3--- // uc2 missing due to reorder, but is is ok
>
> 4) Operations within a single transaction is always applied sequentially,
> and hence also have HB edge:
> op1op2op3
>
> 5) When transaction operation happens, we save in memory current update
> counter available at this moment. I.e. we have a map from transaction ID to
> update counter which was relevant by the time last *completed* operation
> *started*. This is very important thing - we remember the counter when
> operation starts, but update the map only when it finishes. This is needed
> for situation when update counter is bumber in the middle of a long
> operation.
> uc1op1op2uc2uc3op3
> |  ||
>uc1uc1  uc3
>
> state: tx1 -> op3 -> uc3
>
> 6) Whenever checkpoint occurs, we save two counters with: "current" and
> "backpointer". The latter is the smallest update counter associated with
> active transactions. If there are no active transactions, current update
> counter is used.
>
> Example 1: no active transactions.
> uc1cp1
>  ^  |
>  
>
> state: cp1 [current=uc1, backpointer=uc1]
>
> Example 2: one active transaction:
>  ---
>  | |
> uc1op1----uc2op2op3uc3cp1
>^ |
>--
>
> state: tx1 -> op3 -> uc2
>cp1 [current=uc3, backpointer=uc2]
>
> 7) Historical rebalance:
> 7.1) Demander finds latest checkpoint, get it's backpointer and sends it
> to supplier.
> 7.2) Supplier finds earliest checkpoint where [supplier(current) <=
> demander(backpointer)]
> 7.3) Supplier reads checkpoint backpointer and finds associated WAL
> record. This is where we start.
>
> So in terms of WAL we have: supplier[uc_backpointer <- cp(uc_current <=
> demanter_uc_backpointer)] <- demander[uc_backpointer <- cp(last)]
>
> Now the most important - why it works :-)
> 1) Transaction opeartions are sequential, so at the time of crash nodes
> are *at most one operation ahead *each other
> 2) Demander goes to the past and finds update counter which was current at
> the time of last TX completed operation
> 3) Supplier goes to the closest checkpoint in the past where this update
> counter either doesn't exist or just appeared
> 4) Transaction cannot be committed on supplier at this checkpoint, as it
> would violate UC happens-before rule
> 5) Tranasction may have not started yet on supplier at this point. If more
> recent WAL records will contain *ALL* updates of the transaction
> 6) Transaction may exist on supplier at this checkpoint. Thanks to p.1 we
> must skip at most one operation. Jump back through supplier's checkpoint
> backpointer is guaranteed to do this.
>
> Igor, do we have the same understanding here?
>
> Vladimir.
>
> On Thu, Nov 29, 2018 at 2:47 PM Seliverstov Igor 
> wrote:
>
>> Ivan,
>>
>> different transactions may be applied in different order on backup nodes.
>> That's why we need an active tx set
>> and some sorting by their update times. The idea is to identify a point in
>> time which starting from we may lost some updates.
>> This point:
>>1) is the last acknowledged by all backups (including possible further
>> demander) update on timeline;
>>2) have a specific update counter (aka back-counter) which we going to
>> start iteration from.
>>
>> After additional thinking on, I've ide

Re: Historical rebalance

2018-11-29 Thread Seliverstov Igor
need any additional info in demand message.
> Start point
> > > > > can be easily determined using stored WAL "back-pointer".
> > > > >
> > > > > [1]
> > > > >
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess
> > > > >
> > > > >
> > > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov <
> voze...@gridgain.com>:
> > > > >
> > > > >> Igor,
> > > > >>
> > > > >> Could you please elaborate - what is the whole set of information
> we are
> > > > >> going to save at checkpoint time? From what I understand this
> should be:
> > > > >> 1) List of active transactions with WAL pointers of their first
> writes
> > > > >> 2) List of prepared transactions with their update counters
> > > > >> 3) Partition counter low watermark (LWM) - the smallest partition
> counter
> > > > >> before which there are no prepared transactions.
> > > > >>
> > > > >> And the we send to supplier node a message: "Give me all updates
> starting
> > > > >> from that LWM plus data for that transactions which were active
> when I
> > > > >> failed".
> > > > >>
> > > > >> Am I right?
> > > > >>
> > > > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor <
> gvvinbl...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Igniters,
> > > > >> >
> > > > >> > Currently I’m working on possible approaches how to implement
> historical
> > > > >> > rebalance (delta rebalance using WAL iterator) over MVCC caches.
> > > > >> >
> > > > >> > The main difficulty is that MVCC writes changes on tx active
> phase while
> > > > >> > partition update version, aka update counter, is being applied
> on tx
> > > > >> > finish. This means we cannot start iteration over WAL right
> from the
> > > > >> > pointer where the update counter updated, but should include
> updates,
> > > > >> which
> > > > >> > the transaction that updated the counter did.
> > > > >> >
> > > > >> > These updates may be much earlier than the point where the
> update
> > > > >> counter
> > > > >> > was updated, so we have to be able to identify the point where
> the first
> > > > >> > update happened.
> > > > >> >
> > > > >> > The proposed approach includes:
> > > > >> >
> > > > >> > 1) preserve list of active txs, sorted by the time of their
> first update
> > > > >> > (using WAL ptr of first WAL record in tx)
> > > > >> >
> > > > >> > 2) persist this list on each checkpoint (together with TxLog for
> > > > >> example)
> > > > >> >
> > > > >> > 4) send whole active tx list (transactions which were in active
> state at
> > > > >> > the time the node was crushed, empty list in case of graceful
> node
> > > > >> stop) as
> > > > >> > a part of partition demand message.
> > > > >> >
> > > > >> > 4) find a checkpoint where the earliest tx exists in persisted
> txs and
> > > > >> use
> > > > >> > saved WAL ptr as a start point or apply current approach in
> case the
> > > > >> active
> > > > >> > tx list (sent on previous step) is empty
> > > > >> >
> > > > >> > 5) start iteration.
> > > > >> >
> > > > >> > Your thoughts?
> > > > >> >
> > > > >> > Regards,
> > > > >> > Igor
> > > > >>
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
>
>
>
> --
> Best regards,
> Ivan Pavlukhin
>


Re: Historical rebalance

2018-11-28 Thread Павлухин Иван
Guys,

Another one idea. We can introduce additional update counter which is
incremented by MVCC transactions right after executing operation (like
is done for classic transactions). And we can use that counter for
searching needed WAL records. Can it did the trick?

P.S. Mentally I am trying to separate facilities providing
transactions and durability. And it seems to me that those facilities
are in different dimensions.
ср, 28 нояб. 2018 г. в 16:26, Павлухин Иван :
>
> Sorry, if it was stated that a SINGLE transaction updates are applied
> in a same order on all replicas then I have no questions so far. I
> thought about reordering updates coming from different transactions.
> > I have not got why we can assume that reordering is not possible. What
> have I missed?
> ср, 28 нояб. 2018 г. в 13:26, Павлухин Иван :
> >
> > Hi,
> >
> > Regarding Vladimir's new idea.
> > > We assume that transaction can be represented as a set of independent 
> > > operations, which are applied in the same order on both primary and 
> > > backup nodes.
> > I have not got why we can assume that reordering is not possible. What
> > have I missed?
> > вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor :
> > >
> > > Vladimir,
> > >
> > > I think I got your point,
> > >
> > > It should work if we do the next:
> > > introduce two structures: active list (txs) and candidate list (updCntr ->
> > > txn pairs)
> > >
> > > Track active txs, mapping them to actual update counter at update time.
> > > On each next update put update counter, associated with previous update,
> > > into a candidates list possibly overwrite existing value (checking txn)
> > > On tx finish remove tx from active list only if appropriate update counter
> > > (associated with finished tx) is applied.
> > > On update counter update set the minimal update counter from the 
> > > candidates
> > > list as a back-counter, clear the candidate list and remove an associated
> > > tx from the active list if present.
> > > Use back-counter instead of actual update counter in demand message.
> > >
> > > вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor :
> > >
> > > > Ivan,
> > > >
> > > > 1) The list is saved on each checkpoint, wholly (all transactions in
> > > > active state at checkpoint begin).
> > > > We need whole the list to get oldest transaction because after
> > > > the previous oldest tx finishes, we need to get the following one.
> > > >
> > > > 2) I guess there is a description of how persistent storage works and 
> > > > how
> > > > it restores [1]
> > > >
> > > > Vladimir,
> > > >
> > > > the whole list of what we going to store on checkpoint (updated):
> > > > 1) Partition counter low watermark (LWM)
> > > > 2) WAL pointer of earliest active transaction write to partition at the
> > > > time the checkpoint have started
> > > > 3) List of prepared txs with acquired partition counters (which were
> > > > acquired but not applied yet)
> > > >
> > > > This way we don't need any additional info in demand message. Start 
> > > > point
> > > > can be easily determined using stored WAL "back-pointer".
> > > >
> > > > [1]
> > > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess
> > > >
> > > >
> > > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov :
> > > >
> > > >> Igor,
> > > >>
> > > >> Could you please elaborate - what is the whole set of information we 
> > > >> are
> > > >> going to save at checkpoint time? From what I understand this should 
> > > >> be:
> > > >> 1) List of active transactions with WAL pointers of their first writes
> > > >> 2) List of prepared transactions with their update counters
> > > >> 3) Partition counter low watermark (LWM) - the smallest partition 
> > > >> counter
> > > >> before which there are no prepared transactions.
> > > >>
> > > >> And the we send to supplier node a message: "Give me all updates 
> > > >> starting
> > > >> from that LWM plus data for that transactions which were active when I
> > > >> failed".
> > > >>
&

Re: Historical rebalance

2018-11-28 Thread Павлухин Иван
Sorry, if it was stated that a SINGLE transaction updates are applied
in a same order on all replicas then I have no questions so far. I
thought about reordering updates coming from different transactions.
> I have not got why we can assume that reordering is not possible. What
have I missed?
ср, 28 нояб. 2018 г. в 13:26, Павлухин Иван :
>
> Hi,
>
> Regarding Vladimir's new idea.
> > We assume that transaction can be represented as a set of independent 
> > operations, which are applied in the same order on both primary and backup 
> > nodes.
> I have not got why we can assume that reordering is not possible. What
> have I missed?
> вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor :
> >
> > Vladimir,
> >
> > I think I got your point,
> >
> > It should work if we do the next:
> > introduce two structures: active list (txs) and candidate list (updCntr ->
> > txn pairs)
> >
> > Track active txs, mapping them to actual update counter at update time.
> > On each next update put update counter, associated with previous update,
> > into a candidates list possibly overwrite existing value (checking txn)
> > On tx finish remove tx from active list only if appropriate update counter
> > (associated with finished tx) is applied.
> > On update counter update set the minimal update counter from the candidates
> > list as a back-counter, clear the candidate list and remove an associated
> > tx from the active list if present.
> > Use back-counter instead of actual update counter in demand message.
> >
> > вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor :
> >
> > > Ivan,
> > >
> > > 1) The list is saved on each checkpoint, wholly (all transactions in
> > > active state at checkpoint begin).
> > > We need whole the list to get oldest transaction because after
> > > the previous oldest tx finishes, we need to get the following one.
> > >
> > > 2) I guess there is a description of how persistent storage works and how
> > > it restores [1]
> > >
> > > Vladimir,
> > >
> > > the whole list of what we going to store on checkpoint (updated):
> > > 1) Partition counter low watermark (LWM)
> > > 2) WAL pointer of earliest active transaction write to partition at the
> > > time the checkpoint have started
> > > 3) List of prepared txs with acquired partition counters (which were
> > > acquired but not applied yet)
> > >
> > > This way we don't need any additional info in demand message. Start point
> > > can be easily determined using stored WAL "back-pointer".
> > >
> > > [1]
> > > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess
> > >
> > >
> > > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov :
> > >
> > >> Igor,
> > >>
> > >> Could you please elaborate - what is the whole set of information we are
> > >> going to save at checkpoint time? From what I understand this should be:
> > >> 1) List of active transactions with WAL pointers of their first writes
> > >> 2) List of prepared transactions with their update counters
> > >> 3) Partition counter low watermark (LWM) - the smallest partition counter
> > >> before which there are no prepared transactions.
> > >>
> > >> And the we send to supplier node a message: "Give me all updates starting
> > >> from that LWM plus data for that transactions which were active when I
> > >> failed".
> > >>
> > >> Am I right?
> > >>
> > >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor 
> > >> wrote:
> > >>
> > >> > Hi Igniters,
> > >> >
> > >> > Currently I’m working on possible approaches how to implement 
> > >> > historical
> > >> > rebalance (delta rebalance using WAL iterator) over MVCC caches.
> > >> >
> > >> > The main difficulty is that MVCC writes changes on tx active phase 
> > >> > while
> > >> > partition update version, aka update counter, is being applied on tx
> > >> > finish. This means we cannot start iteration over WAL right from the
> > >> > pointer where the update counter updated, but should include updates,
> > >> which
> > >> > the transaction that updated the counter did.
> > >> >
> > >> > These updates may be m

Re: Historical rebalance

2018-11-28 Thread Павлухин Иван
Hi,

Regarding Vladimir's new idea.
> We assume that transaction can be represented as a set of independent 
> operations, which are applied in the same order on both primary and backup 
> nodes.
I have not got why we can assume that reordering is not possible. What
have I missed?
вт, 27 нояб. 2018 г. в 14:42, Seliverstov Igor :
>
> Vladimir,
>
> I think I got your point,
>
> It should work if we do the next:
> introduce two structures: active list (txs) and candidate list (updCntr ->
> txn pairs)
>
> Track active txs, mapping them to actual update counter at update time.
> On each next update put update counter, associated with previous update,
> into a candidates list possibly overwrite existing value (checking txn)
> On tx finish remove tx from active list only if appropriate update counter
> (associated with finished tx) is applied.
> On update counter update set the minimal update counter from the candidates
> list as a back-counter, clear the candidate list and remove an associated
> tx from the active list if present.
> Use back-counter instead of actual update counter in demand message.
>
> вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor :
>
> > Ivan,
> >
> > 1) The list is saved on each checkpoint, wholly (all transactions in
> > active state at checkpoint begin).
> > We need whole the list to get oldest transaction because after
> > the previous oldest tx finishes, we need to get the following one.
> >
> > 2) I guess there is a description of how persistent storage works and how
> > it restores [1]
> >
> > Vladimir,
> >
> > the whole list of what we going to store on checkpoint (updated):
> > 1) Partition counter low watermark (LWM)
> > 2) WAL pointer of earliest active transaction write to partition at the
> > time the checkpoint have started
> > 3) List of prepared txs with acquired partition counters (which were
> > acquired but not applied yet)
> >
> > This way we don't need any additional info in demand message. Start point
> > can be easily determined using stored WAL "back-pointer".
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess
> >
> >
> > вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov :
> >
> >> Igor,
> >>
> >> Could you please elaborate - what is the whole set of information we are
> >> going to save at checkpoint time? From what I understand this should be:
> >> 1) List of active transactions with WAL pointers of their first writes
> >> 2) List of prepared transactions with their update counters
> >> 3) Partition counter low watermark (LWM) - the smallest partition counter
> >> before which there are no prepared transactions.
> >>
> >> And the we send to supplier node a message: "Give me all updates starting
> >> from that LWM plus data for that transactions which were active when I
> >> failed".
> >>
> >> Am I right?
> >>
> >> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor 
> >> wrote:
> >>
> >> > Hi Igniters,
> >> >
> >> > Currently I’m working on possible approaches how to implement historical
> >> > rebalance (delta rebalance using WAL iterator) over MVCC caches.
> >> >
> >> > The main difficulty is that MVCC writes changes on tx active phase while
> >> > partition update version, aka update counter, is being applied on tx
> >> > finish. This means we cannot start iteration over WAL right from the
> >> > pointer where the update counter updated, but should include updates,
> >> which
> >> > the transaction that updated the counter did.
> >> >
> >> > These updates may be much earlier than the point where the update
> >> counter
> >> > was updated, so we have to be able to identify the point where the first
> >> > update happened.
> >> >
> >> > The proposed approach includes:
> >> >
> >> > 1) preserve list of active txs, sorted by the time of their first update
> >> > (using WAL ptr of first WAL record in tx)
> >> >
> >> > 2) persist this list on each checkpoint (together with TxLog for
> >> example)
> >> >
> >> > 4) send whole active tx list (transactions which were in active state at
> >> > the time the node was crushed, empty list in case of graceful node
> >> stop) as
> >> > a part of partition demand message.
> >> >
> >> > 4) find a checkpoint where the earliest tx exists in persisted txs and
> >> use
> >> > saved WAL ptr as a start point or apply current approach in case the
> >> active
> >> > tx list (sent on previous step) is empty
> >> >
> >> > 5) start iteration.
> >> >
> >> > Your thoughts?
> >> >
> >> > Regards,
> >> > Igor
> >>
> >



-- 
Best regards,
Ivan Pavlukhin


Re: Historical rebalance

2018-11-27 Thread Seliverstov Igor
Vladimir,

I think I got your point,

It should work if we do the next:
introduce two structures: active list (txs) and candidate list (updCntr ->
txn pairs)

Track active txs, mapping them to actual update counter at update time.
On each next update put update counter, associated with previous update,
into a candidates list possibly overwrite existing value (checking txn)
On tx finish remove tx from active list only if appropriate update counter
(associated with finished tx) is applied.
On update counter update set the minimal update counter from the candidates
list as a back-counter, clear the candidate list and remove an associated
tx from the active list if present.
Use back-counter instead of actual update counter in demand message.

вт, 27 нояб. 2018 г. в 12:56, Seliverstov Igor :

> Ivan,
>
> 1) The list is saved on each checkpoint, wholly (all transactions in
> active state at checkpoint begin).
> We need whole the list to get oldest transaction because after
> the previous oldest tx finishes, we need to get the following one.
>
> 2) I guess there is a description of how persistent storage works and how
> it restores [1]
>
> Vladimir,
>
> the whole list of what we going to store on checkpoint (updated):
> 1) Partition counter low watermark (LWM)
> 2) WAL pointer of earliest active transaction write to partition at the
> time the checkpoint have started
> 3) List of prepared txs with acquired partition counters (which were
> acquired but not applied yet)
>
> This way we don't need any additional info in demand message. Start point
> can be easily determined using stored WAL "back-pointer".
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess
>
>
> вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov :
>
>> Igor,
>>
>> Could you please elaborate - what is the whole set of information we are
>> going to save at checkpoint time? From what I understand this should be:
>> 1) List of active transactions with WAL pointers of their first writes
>> 2) List of prepared transactions with their update counters
>> 3) Partition counter low watermark (LWM) - the smallest partition counter
>> before which there are no prepared transactions.
>>
>> And the we send to supplier node a message: "Give me all updates starting
>> from that LWM plus data for that transactions which were active when I
>> failed".
>>
>> Am I right?
>>
>> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor 
>> wrote:
>>
>> > Hi Igniters,
>> >
>> > Currently I’m working on possible approaches how to implement historical
>> > rebalance (delta rebalance using WAL iterator) over MVCC caches.
>> >
>> > The main difficulty is that MVCC writes changes on tx active phase while
>> > partition update version, aka update counter, is being applied on tx
>> > finish. This means we cannot start iteration over WAL right from the
>> > pointer where the update counter updated, but should include updates,
>> which
>> > the transaction that updated the counter did.
>> >
>> > These updates may be much earlier than the point where the update
>> counter
>> > was updated, so we have to be able to identify the point where the first
>> > update happened.
>> >
>> > The proposed approach includes:
>> >
>> > 1) preserve list of active txs, sorted by the time of their first update
>> > (using WAL ptr of first WAL record in tx)
>> >
>> > 2) persist this list on each checkpoint (together with TxLog for
>> example)
>> >
>> > 4) send whole active tx list (transactions which were in active state at
>> > the time the node was crushed, empty list in case of graceful node
>> stop) as
>> > a part of partition demand message.
>> >
>> > 4) find a checkpoint where the earliest tx exists in persisted txs and
>> use
>> > saved WAL ptr as a start point or apply current approach in case the
>> active
>> > tx list (sent on previous step) is empty
>> >
>> > 5) start iteration.
>> >
>> > Your thoughts?
>> >
>> > Regards,
>> > Igor
>>
>


Re: Historical rebalance

2018-11-27 Thread Seliverstov Igor
Ivan,

1) The list is saved on each checkpoint, wholly (all transactions in active
state at checkpoint begin).
We need whole the list to get oldest transaction because after
the previous oldest tx finishes, we need to get the following one.

2) I guess there is a description of how persistent storage works and how
it restores [1]

Vladimir,

the whole list of what we going to store on checkpoint (updated):
1) Partition counter low watermark (LWM)
2) WAL pointer of earliest active transaction write to partition at the
time the checkpoint have started
3) List of prepared txs with acquired partition counters (which were
acquired but not applied yet)

This way we don't need any additional info in demand message. Start point
can be easily determined using stored WAL "back-pointer".

[1]
https://cwiki.apache.org/confluence/display/IGNITE/Ignite+Persistent+Store+-+under+the+hood#IgnitePersistentStore-underthehood-LocalRecoveryProcess


вт, 27 нояб. 2018 г. в 11:19, Vladimir Ozerov :

> Igor,
>
> Could you please elaborate - what is the whole set of information we are
> going to save at checkpoint time? From what I understand this should be:
> 1) List of active transactions with WAL pointers of their first writes
> 2) List of prepared transactions with their update counters
> 3) Partition counter low watermark (LWM) - the smallest partition counter
> before which there are no prepared transactions.
>
> And the we send to supplier node a message: "Give me all updates starting
> from that LWM plus data for that transactions which were active when I
> failed".
>
> Am I right?
>
> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor 
> wrote:
>
> > Hi Igniters,
> >
> > Currently I’m working on possible approaches how to implement historical
> > rebalance (delta rebalance using WAL iterator) over MVCC caches.
> >
> > The main difficulty is that MVCC writes changes on tx active phase while
> > partition update version, aka update counter, is being applied on tx
> > finish. This means we cannot start iteration over WAL right from the
> > pointer where the update counter updated, but should include updates,
> which
> > the transaction that updated the counter did.
> >
> > These updates may be much earlier than the point where the update counter
> > was updated, so we have to be able to identify the point where the first
> > update happened.
> >
> > The proposed approach includes:
> >
> > 1) preserve list of active txs, sorted by the time of their first update
> > (using WAL ptr of first WAL record in tx)
> >
> > 2) persist this list on each checkpoint (together with TxLog for example)
> >
> > 4) send whole active tx list (transactions which were in active state at
> > the time the node was crushed, empty list in case of graceful node stop)
> as
> > a part of partition demand message.
> >
> > 4) find a checkpoint where the earliest tx exists in persisted txs and
> use
> > saved WAL ptr as a start point or apply current approach in case the
> active
> > tx list (sent on previous step) is empty
> >
> > 5) start iteration.
> >
> > Your thoughts?
> >
> > Regards,
> > Igor
>


Re: Historical rebalance

2018-11-27 Thread Vladimir Ozerov
Just thought of this a bit more. I we will look for start of long-running
transaction in WAL we may go back too far to the past only to get few
entries.

What if we consider slightly different approach? We assume that transaction
can be represented as a set of independent operations, which are applied in
the same order on both primary and backup nodes. Then we can do the
following:
1) When next operation is finished, we assign transaction LWM of the last
checkpoint. I.e. we maintain a map [Txn -> last_op_LWM].
2) If "last_op_LWM" of transaction is not changed between two subsequent
checkpoints, we assign it to special value "UP_TO_DATE".

Now at the time of checkpoint we get minimal value among current partition
LWM and active transaction LWMs, ignoring "UP_TO_DATE" values. Resulting
value is the final partition counter which we will request from supplier
node. We save it to checkpoint record. When WAL on demander is unwound from
this value, then it is guaranteed to contain all missing data of
demanders's active transactions.

I.e. instead of tracking the whole active transaction, we track part of
transaction which is possibly missing on a node.

Will that work?


On Tue, Nov 27, 2018 at 11:19 AM Vladimir Ozerov 
wrote:

> Igor,
>
> Could you please elaborate - what is the whole set of information we are
> going to save at checkpoint time? From what I understand this should be:
> 1) List of active transactions with WAL pointers of their first writes
> 2) List of prepared transactions with their update counters
> 3) Partition counter low watermark (LWM) - the smallest partition counter
> before which there are no prepared transactions.
>
> And the we send to supplier node a message: "Give me all updates starting
> from that LWM plus data for that transactions which were active when I
> failed".
>
> Am I right?
>
> On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor 
> wrote:
>
>> Hi Igniters,
>>
>> Currently I’m working on possible approaches how to implement historical
>> rebalance (delta rebalance using WAL iterator) over MVCC caches.
>>
>> The main difficulty is that MVCC writes changes on tx active phase while
>> partition update version, aka update counter, is being applied on tx
>> finish. This means we cannot start iteration over WAL right from the
>> pointer where the update counter updated, but should include updates, which
>> the transaction that updated the counter did.
>>
>> These updates may be much earlier than the point where the update counter
>> was updated, so we have to be able to identify the point where the first
>> update happened.
>>
>> The proposed approach includes:
>>
>> 1) preserve list of active txs, sorted by the time of their first update
>> (using WAL ptr of first WAL record in tx)
>>
>> 2) persist this list on each checkpoint (together with TxLog for example)
>>
>> 4) send whole active tx list (transactions which were in active state at
>> the time the node was crushed, empty list in case of graceful node stop) as
>> a part of partition demand message.
>>
>> 4) find a checkpoint where the earliest tx exists in persisted txs and
>> use saved WAL ptr as a start point or apply current approach in case the
>> active tx list (sent on previous step) is empty
>>
>> 5) start iteration.
>>
>> Your thoughts?
>>
>> Regards,
>> Igor
>
>


Re: Historical rebalance

2018-11-27 Thread Vladimir Ozerov
Igor,

Could you please elaborate - what is the whole set of information we are
going to save at checkpoint time? From what I understand this should be:
1) List of active transactions with WAL pointers of their first writes
2) List of prepared transactions with their update counters
3) Partition counter low watermark (LWM) - the smallest partition counter
before which there are no prepared transactions.

And the we send to supplier node a message: "Give me all updates starting
from that LWM plus data for that transactions which were active when I
failed".

Am I right?

On Fri, Nov 23, 2018 at 11:22 AM Seliverstov Igor 
wrote:

> Hi Igniters,
>
> Currently I’m working on possible approaches how to implement historical
> rebalance (delta rebalance using WAL iterator) over MVCC caches.
>
> The main difficulty is that MVCC writes changes on tx active phase while
> partition update version, aka update counter, is being applied on tx
> finish. This means we cannot start iteration over WAL right from the
> pointer where the update counter updated, but should include updates, which
> the transaction that updated the counter did.
>
> These updates may be much earlier than the point where the update counter
> was updated, so we have to be able to identify the point where the first
> update happened.
>
> The proposed approach includes:
>
> 1) preserve list of active txs, sorted by the time of their first update
> (using WAL ptr of first WAL record in tx)
>
> 2) persist this list on each checkpoint (together with TxLog for example)
>
> 4) send whole active tx list (transactions which were in active state at
> the time the node was crushed, empty list in case of graceful node stop) as
> a part of partition demand message.
>
> 4) find a checkpoint where the earliest tx exists in persisted txs and use
> saved WAL ptr as a start point or apply current approach in case the active
> tx list (sent on previous step) is empty
>
> 5) start iteration.
>
> Your thoughts?
>
> Regards,
> Igor


Re: Historical rebalance

2018-11-26 Thread Павлухин Иван
Igor,

Could you please clarify some points?

> 1) preserve list of active txs, sorted by the time of their first update 
> (using WAL ptr of first WAL record in tx)

Is this list maintained per transaction or per checkpoint (or per
something else)? Why can't we track only oldest active transaction
instead of whole active list?

> 4) find a checkpoint where the earliest tx exists in persisted txs and use 
> saved WAL ptr as a start point or apply current approach in case the active 
> tx list (sent on previous step) is empty

What is the base storage state on a demanding node to which we are
applying WAL records to? I mean the state before applying WAL records.
Do we apply all records simply one by one or filter out some of them?
пт, 23 нояб. 2018 г. в 11:22, Seliverstov Igor :
>
> Hi Igniters,
>
> Currently I’m working on possible approaches how to implement historical 
> rebalance (delta rebalance using WAL iterator) over MVCC caches.
>
> The main difficulty is that MVCC writes changes on tx active phase while 
> partition update version, aka update counter, is being applied on tx finish. 
> This means we cannot start iteration over WAL right from the pointer where 
> the update counter updated, but should include updates, which the transaction 
> that updated the counter did.
>
> These updates may be much earlier than the point where the update counter was 
> updated, so we have to be able to identify the point where the first update 
> happened.
>
> The proposed approach includes:
>
> 1) preserve list of active txs, sorted by the time of their first update 
> (using WAL ptr of first WAL record in tx)
>
> 2) persist this list on each checkpoint (together with TxLog for example)
>
> 4) send whole active tx list (transactions which were in active state at the 
> time the node was crushed, empty list in case of graceful node stop) as a 
> part of partition demand message.
>
> 4) find a checkpoint where the earliest tx exists in persisted txs and use 
> saved WAL ptr as a start point or apply current approach in case the active 
> tx list (sent on previous step) is empty
>
> 5) start iteration.
>
> Your thoughts?
>
> Regards,
> Igor



-- 
Best regards,
Ivan Pavlukhin


Historical rebalance

2018-11-23 Thread Seliverstov Igor
Hi Igniters,

Currently I’m working on possible approaches how to implement historical 
rebalance (delta rebalance using WAL iterator) over MVCC caches.

The main difficulty is that MVCC writes changes on tx active phase while 
partition update version, aka update counter, is being applied on tx finish. 
This means we cannot start iteration over WAL right from the pointer where the 
update counter updated, but should include updates, which the transaction that 
updated the counter did.

These updates may be much earlier than the point where the update counter was 
updated, so we have to be able to identify the point where the first update 
happened.

The proposed approach includes:

1) preserve list of active txs, sorted by the time of their first update (using 
WAL ptr of first WAL record in tx)

2) persist this list on each checkpoint (together with TxLog for example)

4) send whole active tx list (transactions which were in active state at the 
time the node was crushed, empty list in case of graceful node stop) as a part 
of partition demand message.

4) find a checkpoint where the earliest tx exists in persisted txs and use 
saved WAL ptr as a start point or apply current approach in case the active tx 
list (sent on previous step) is empty

5) start iteration.

Your thoughts?

Regards,
Igor

[jira] [Created] (IGNITE-10117) Node is mistakenly excluded from history suppliers preventing historical rebalance.

2018-11-01 Thread Alexei Scherbakov (JIRA)
Alexei Scherbakov created IGNITE-10117:
--

 Summary: Node is mistakenly excluded from history suppliers 
preventing historical rebalance.
 Key: IGNITE-10117
 URL: https://issues.apache.org/jira/browse/IGNITE-10117
 Project: Ignite
  Issue Type: Bug
Reporter: Alexei Scherbakov
 Fix For: 2.8


This is because 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#reserveHistoryForExchange
 is called before 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager#beforeExchange,
 which restores correct partition state.

{noformat}
public void testHistory() throws Exception {
IgniteEx crd = startGrid(0);
startGrid(1);

crd.cluster().active(true);

awaitPartitionMapExchange();

int part = 0;

List keys = loadDataToPartition(part, DEFAULT_CACHE_NAME, 100, 
0, 1);

forceCheckpoint(); // Prevent IGNITE-10088

stopAllGrids();

awaitPartitionMapExchange();

List keys1 = loadDataToPartition(part, DEFAULT_CACHE_NAME, 
100, 100, 1);

startGrid(0);
startGrid(1);

awaitPartitionMapExchange(); // grid0 will not provide history.
}
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-9827) Assertion error due historical rebalance

2018-10-09 Thread ARomantsov (JIRA)
ARomantsov created IGNITE-9827:
--

 Summary: Assertion error due historical rebalance
 Key: IGNITE-9827
 URL: https://issues.apache.org/jira/browse/IGNITE-9827
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Affects Versions: 2.7
Reporter: ARomantsov
 Fix For: 2.8


I work with next situation

1) Start two nodes with '-DIGNITE_PDS_WAL_REBALANCE_THRESHOLD=0',
2) Preload
3) Stop node 2
4) Load
5) Corrupt all wal archive file in node 1
5) Start node 2

And found assertion in log coordinator - is it ok?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8946) AssertionError can occur during reservation of WAL history for historical rebalance

2018-07-05 Thread Ivan Rakov (JIRA)
Ivan Rakov created IGNITE-8946:
--

 Summary: AssertionError can occur during reservation of WAL 
history for historical rebalance
 Key: IGNITE-8946
 URL: https://issues.apache.org/jira/browse/IGNITE-8946
 Project: Ignite
  Issue Type: Bug
Reporter: Ivan Rakov


Attempt to release WAL after exchange may fail with AssertionError. Seems like 
we have a bug and may try to release more WAL segments than we have reserved:
{noformat}
java.lang.AssertionError: null
at 
org.apache.ignite.internal.processors.cache.persistence.wal.SegmentReservationStorage.release(SegmentReservationStorage.java:54)
  - locked <0x1c12> (a 
org.apache.ignite.internal.processors.cache.persistence.wal.SegmentReservationStorage)
  at 
org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.release(FileWriteAheadLogManager.java:862)
  at 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.releaseHistoryForExchange(GridCacheDatabaseSharedManager.java:1691)
  - locked <0x1c17> (a 
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:1751)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:2858)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:2591)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.processSingleMessage(GridDhtPartitionsExchangeFuture.java:2283)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$100(GridDhtPartitionsExchangeFuture.java:129)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2140)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2.apply(GridDhtPartitionsExchangeFuture.java:2128)
  at 
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:383)
  at 
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:353)
  at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onReceiveSingleMessage(GridDhtPartitionsExchangeFuture.java:2128)
  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.processSinglePartitionUpdate(GridCachePartitionExchangeManager.java:1580)
  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.access$1000(GridCachePartitionExchangeManager.java:138)
  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:345)
  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$2.onMessage(GridCachePartitionExchangeManager.java:325)
  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2848)
  at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$MessageHandler.apply(GridCachePartitionExchangeManager.java:2827)
  at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
  at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
  at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
  at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
  at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
  at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
  at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
  at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
  at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:125)
  at 
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManag

[GitHub] ignite pull request #3791: IGNITE-8116 Historical rebalance fixes

2018-05-18 Thread Jokser
Github user Jokser closed the pull request at:

https://github.com/apache/ignite/pull/3791


---


[jira] [Created] (IGNITE-8390) WAL historical rebalance is not able to process cache.remove() updates

2018-04-25 Thread Pavel Kovalenko (JIRA)
Pavel Kovalenko created IGNITE-8390:
---

 Summary: WAL historical rebalance is not able to process 
cache.remove() updates
 Key: IGNITE-8390
 URL: https://issues.apache.org/jira/browse/IGNITE-8390
 Project: Ignite
  Issue Type: Bug
  Components: cache
Affects Versions: 2.4
Reporter: Pavel Kovalenko
Assignee: Pavel Kovalenko


WAL historical rebalance fails on supplier when process entry remove with 
following assertion:

{noformat}
java.lang.AssertionError: GridCacheEntryInfo [key=KeyCacheObjectImpl [part=-1, 
val=2, hasValBytes=true], cacheId=94416770, val=null, ttl=0, expireTime=0, 
ver=GridCacheVersion [topVer=136155335, order=1524675346187, nodeOrder=1], 
isNew=false, deleted=false]
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplyMessage.addEntry0(GridDhtPartitionSupplyMessage.java:220)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:381)
at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
at 
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
at 
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

Obviously this assertion will work correctly only for full rebalance. We should 
either soft assertion for historical rebalance case or disable it.
In case of disabled assertion everything works well and rebalance finished 
properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[GitHub] ignite pull request #3791: IGNITE-8116 Historical rebalance fixes

2018-04-10 Thread Jokser
GitHub user Jokser opened a pull request:

https://github.com/apache/ignite/pull/3791

IGNITE-8116 Historical rebalance fixes



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gridgain/apache-ignite ignite-8116-8122

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/ignite/pull/3791.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3791


commit 604ee719b304d0b4cf4caabaa6fa16b5a980e04e
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-04T09:33:10Z

IGNITE-8122 Restore partition state to OWNING if unable to read from page 
memory.

commit c43c598916bd9bda2f624a214cefcc28b0380a5c
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-05T14:26:53Z

IGNITE-8122 Restore partitions when persistence is enabled with OWNING 
default state.

commit a061cdba3a46f5384910fecf89366101f904ccdf
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-05T14:50:20Z

IGNITE-8122 Move OWN logic to GridDhtLocalPartition constructor.

commit 9259407a462633a68de6dcd7d3ae135c8c7c0b37
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-05T17:15:39Z

IGNITE-8122 Docs.

commit 20979dc4874cfeca649b29d8827d522f3e57ee67
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-05T17:16:54Z

Merge branch 'master' into ignite-8122

commit 791ef918335e66d274a10c2b5d21c9da1c212a0b
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-06T12:53:21Z

IGNITE-8122 Restore partition in OWNING state correctly.

commit 56bdb20513731f49bf3a2b51f672396efc16bfe1
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-09T11:42:59Z

IGNITE-8122 Restore partition states on before exchange in case of starting 
cache group.

commit dcea5b47a5a08f165930a2e0235ecf51385f4997
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-09T11:55:53Z

IGNITE-8122 Fixed test with auto-activation.

commit 915788c7ea25f04ccacfa06e1f19c06c92f7c141
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-09T15:32:59Z

IGNITE-8116 WIP

commit d224cd5663415dc8636f6ba01bfd5002fdb35a3e
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-10T10:59:20Z

Merge branch 'ignite-8122' into ignite-8116-8122

commit e988272e57760ee44753314ad8db2adab344a6c0
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-10T11:47:04Z

IGNITE-8116 WIP

commit bd53648935b24fb2c68a885db81656f2d960ba1e
Author: Pavel Kovalenko <jokserfn@...>
Date:   2018-04-10T18:09:32Z

IGNITE-8116 WAL rebalance issues.




---