Re: Shutdown policy refactoring

Stanislav Lukyanov Sun, 30 Oct 2022 11:46:54 -0700

Hi Aleksandr,

Thanks for driving this. Someone should put a stop to the shutdown behavior 
nonsense :)


But let me complicate things a bit for you.

I see multiple dimensions that are in play when node shutdown happens:

- Running operations (Compute, SQL, Cache): interrupt or wait for completion? 
If we wait r the for how long?
- Persistence checkpoint: just shutdown? wait for completion if running? 
trigger and wait for completion? If we wait then for how long?
- Partition loss: just shutdown? wait for more copies to avoid partition loss? 
is behavior the same for in-mem and persistence? is behavior the same for 
backups=0?
- (anything else?)

Ideally, we should first analyse each of these three dimensions and understand 
what are the per-dimensions behaviours we can come up with - what is reasonable.
Then, we can try combine the dimensions: introduce an enum of policies that 
picks one behavior for each dimension.
I guess, it should be a table like

Policy           | Operations   | Checkpoint                     | Partition 
Loss
IMMEDIATE | Cancel         | Cancel                            | Ignore
GRACEFUL | Wait forever | Trigger and wait forever | Wait forever

This table should be a part of the Javadoc and the website docs.

On the other hand, you might want to separate the policies and control the 
running operations, the checkpoint, and the rebalance/partition loss 
separately. But then it won't solve the initial mess probably?

Thoughts?

Thanks,
Stan

> On 25 Oct 2022, at 17:17, Aleksandr Polovtsev <[email protected]> wrote:
> 
> Guys, any more feedback?
> 
> On Tue, Oct 18, 2022 at 12:42 PM Aleksandr Polovtsev <
> [email protected]> wrote:
> 
>> My guess is that it reduces the node shutdown time. We may want to keep it
>> for compatibility purposes, because I'm afraid that changing the default
>> behaviour is not that easy.
>> 
>> On Tue, Oct 18, 2022 at 12:32 PM Николай Ижиков <[email protected]>
>> wrote:
>> 
>>> So why it’s default behavior?
>>> And why we want keep it?
>>> 
>>>> 18 окт. 2022 г., в 12:27, Aleksandr Polovtsev <[email protected]>
>>> написал(а):
>>>> 
>>>> But this is the default behaviour right now, do you mean we should
>>> change
>>>> it? I'm ok with that if it will not break something
>>>> 
>>>> On Tue, Oct 18, 2022 at 12:13 PM Николай Ижиков <[email protected]>
>>> wrote:
>>>> 
>>>>>> I don't know what most people use in the real world, maybe more
>>>>> experienced guys can clarify that.
>>>>> 
>>>>> My question is different.
>>>>> 
>>>>> Why we may want to provide cancel=true?
>>>>> Is there any reason to support this in Ignite?
>>>>> 
>>>>>> 18 окт. 2022 г., в 11:47, Aleksandr Polovtsev <
>>> [email protected]>
>>>>> написал(а):
>>>>>> 
>>>>>> Thank you for the response, Nikolay. As I can see, this is the default
>>>>>> behaviour now (ShutdownPolicy = IMMEDIATE, cancel = true). I don't
>>> know
>>>>>> what most people use in the real world, maybe more experienced guys
>>> can
>>>>>> clarify that.
>>>>>> 
>>>>>> On Tue, Oct 18, 2022 at 11:15 AM Николай Ижиков <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> Why do we need «immediate»?
>>>>>>> Is there real-world scenarios for it?
>>>>>>> 
>>>>>>> Can we do graceful or «save all possible data» shutdown always?
>>>>>>> 
>>>>>>>> 18 окт. 2022 г., в 10:59, Aleksandr Polovtsev <
>>> [email protected]
>>>>>> 
>>>>>>> написал(а):
>>>>>>>> 
>>>>>>>> Hello, dear Igniters!
>>>>>>>> 
>>>>>>>> I'd like to propose a refactoring of the current Shutdown Policy
>>>>>>> mechanism.
>>>>>>>> Currently, node shutdown is controlled by a bunch of flags and
>>> enums (I
>>>>>>> may
>>>>>>>> be missing some of them):
>>>>>>>> 
>>>>>>>> 1. ShutdownPolicy enum has only two entries and is basically a flag
>>>>> that
>>>>>>>> dictates if we need to wait for the data to be backed up, so that we
>>>>> will
>>>>>>>> not lose any data when the node goes offline.
>>>>>>>> 2. "cancel" flag is passed to the stop method and indicates that all
>>>>> long
>>>>>>>> running jobs should be interrupted. For example, it prevents the
>>> node
>>>>>>>> performing a checkpoint on shutdown. I'm pretty sure this flag is
>>>>> nearly
>>>>>>>> always set to "true", because it is used by the Shutdown Hook.
>>>>>>>> 3. There are also a bunch of system properties that affect the
>>>>> shutdown:
>>>>>>>> * IGNITE_PDS_SKIP_CHECKPOINT_ON_NODE_STOP - disables checkpoints on
>>>>>>>> node stop, which heavily correlates with the "cancel" flag.
>>>>>>>> * IGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN - deprecated and duplicates
>>> the
>>>>>>>> ShutdownPolicy enum
>>>>>>>> * IGNITE_NO_SHUTDOWN_HOOK - disables the shutdown hook. I don't
>>> know
>>>>>>> if
>>>>>>>> it is really useful.
>>>>>>>> 
>>>>>>>> I think that this approach is counter intuitive and would like to
>>>>>>>> incorporate all aforementioned flags into the ShutdownPolicy enum.
>>> For
>>>>>>>> example, it can look as follows:
>>>>>>>> 
>>>>>>>> public enum ShutdownPolicy {
>>>>>>>> IMMEDIATE // Stops the node and cancels all running jobs
>>>>>>>> GRACEFUL // Stops the node, does not cancel running jobs and waits
>>>>> for
>>>>>>>> backups
>>>>>>>> GRACEFUL_NO_BACKUPS // Stops the node, does not cancel running
>>> jobs,
>>>>>>>> does not wait for backups
>>>>>>>> }
>>>>>>>> 
>>>>>>>> After that, the "cancel" flag and all corresponding properties can
>>> be
>>>>>>>> removed (apart from the IGNITE_NO_SHUTDOWN_HOOK, if it is still
>>>>> needed).
>>>>>>>> 
>>>>>>>> Does this make sense? What do you think?
>>>>>>>> 
>>>>>>>> --
>>>>>>>> With regards,
>>>>>>>> Aleksandr Polovtsev
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> With regards,
>>>>>> Aleksandr Polovtsev
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> With regards,
>>>> Aleksandr Polovtsev
>>> 
>>> 
>> 
>> --
>> With regards,
>> Aleksandr Polovtsev
>> 
> 
> 
> -- 
> With regards,
> Aleksandr Polovtsev

Re: Shutdown policy refactoring

Reply via email to