Hi Aleksandr, Thanks for driving this. Someone should put a stop to the shutdown behavior nonsense :)
But let me complicate things a bit for you. I see multiple dimensions that are in play when node shutdown happens: - Running operations (Compute, SQL, Cache): interrupt or wait for completion? If we wait r the for how long? - Persistence checkpoint: just shutdown? wait for completion if running? trigger and wait for completion? If we wait then for how long? - Partition loss: just shutdown? wait for more copies to avoid partition loss? is behavior the same for in-mem and persistence? is behavior the same for backups=0? - (anything else?) Ideally, we should first analyse each of these three dimensions and understand what are the per-dimensions behaviours we can come up with - what is reasonable. Then, we can try combine the dimensions: introduce an enum of policies that picks one behavior for each dimension. I guess, it should be a table like Policy | Operations | Checkpoint | Partition Loss IMMEDIATE | Cancel | Cancel | Ignore GRACEFUL | Wait forever | Trigger and wait forever | Wait forever This table should be a part of the Javadoc and the website docs. On the other hand, you might want to separate the policies and control the running operations, the checkpoint, and the rebalance/partition loss separately. But then it won't solve the initial mess probably? Thoughts? Thanks, Stan > On 25 Oct 2022, at 17:17, Aleksandr Polovtsev <alexpolovt...@gmail.com> wrote: > > Guys, any more feedback? > > On Tue, Oct 18, 2022 at 12:42 PM Aleksandr Polovtsev < > alexpolovt...@gmail.com> wrote: > >> My guess is that it reduces the node shutdown time. We may want to keep it >> for compatibility purposes, because I'm afraid that changing the default >> behaviour is not that easy. >> >> On Tue, Oct 18, 2022 at 12:32 PM Николай Ижиков <nizhi...@apache.org> >> wrote: >> >>> So why it’s default behavior? >>> And why we want keep it? >>> >>>> 18 окт. 2022 г., в 12:27, Aleksandr Polovtsev <alexpolovt...@gmail.com> >>> написал(а): >>>> >>>> But this is the default behaviour right now, do you mean we should >>> change >>>> it? I'm ok with that if it will not break something >>>> >>>> On Tue, Oct 18, 2022 at 12:13 PM Николай Ижиков <nizhi...@apache.org> >>> wrote: >>>> >>>>>> I don't know what most people use in the real world, maybe more >>>>> experienced guys can clarify that. >>>>> >>>>> My question is different. >>>>> >>>>> Why we may want to provide cancel=true? >>>>> Is there any reason to support this in Ignite? >>>>> >>>>>> 18 окт. 2022 г., в 11:47, Aleksandr Polovtsev < >>> alexpolovt...@gmail.com> >>>>> написал(а): >>>>>> >>>>>> Thank you for the response, Nikolay. As I can see, this is the default >>>>>> behaviour now (ShutdownPolicy = IMMEDIATE, cancel = true). I don't >>> know >>>>>> what most people use in the real world, maybe more experienced guys >>> can >>>>>> clarify that. >>>>>> >>>>>> On Tue, Oct 18, 2022 at 11:15 AM Николай Ижиков <nizhi...@apache.org> >>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Why do we need «immediate»? >>>>>>> Is there real-world scenarios for it? >>>>>>> >>>>>>> Can we do graceful or «save all possible data» shutdown always? >>>>>>> >>>>>>>> 18 окт. 2022 г., в 10:59, Aleksandr Polovtsev < >>> alexpolovt...@gmail.com >>>>>> >>>>>>> написал(а): >>>>>>>> >>>>>>>> Hello, dear Igniters! >>>>>>>> >>>>>>>> I'd like to propose a refactoring of the current Shutdown Policy >>>>>>> mechanism. >>>>>>>> Currently, node shutdown is controlled by a bunch of flags and >>> enums (I >>>>>>> may >>>>>>>> be missing some of them): >>>>>>>> >>>>>>>> 1. ShutdownPolicy enum has only two entries and is basically a flag >>>>> that >>>>>>>> dictates if we need to wait for the data to be backed up, so that we >>>>> will >>>>>>>> not lose any data when the node goes offline. >>>>>>>> 2. "cancel" flag is passed to the stop method and indicates that all >>>>> long >>>>>>>> running jobs should be interrupted. For example, it prevents the >>> node >>>>>>>> performing a checkpoint on shutdown. I'm pretty sure this flag is >>>>> nearly >>>>>>>> always set to "true", because it is used by the Shutdown Hook. >>>>>>>> 3. There are also a bunch of system properties that affect the >>>>> shutdown: >>>>>>>> * IGNITE_PDS_SKIP_CHECKPOINT_ON_NODE_STOP - disables checkpoints on >>>>>>>> node stop, which heavily correlates with the "cancel" flag. >>>>>>>> * IGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN - deprecated and duplicates >>> the >>>>>>>> ShutdownPolicy enum >>>>>>>> * IGNITE_NO_SHUTDOWN_HOOK - disables the shutdown hook. I don't >>> know >>>>>>> if >>>>>>>> it is really useful. >>>>>>>> >>>>>>>> I think that this approach is counter intuitive and would like to >>>>>>>> incorporate all aforementioned flags into the ShutdownPolicy enum. >>> For >>>>>>>> example, it can look as follows: >>>>>>>> >>>>>>>> public enum ShutdownPolicy { >>>>>>>> IMMEDIATE // Stops the node and cancels all running jobs >>>>>>>> GRACEFUL // Stops the node, does not cancel running jobs and waits >>>>> for >>>>>>>> backups >>>>>>>> GRACEFUL_NO_BACKUPS // Stops the node, does not cancel running >>> jobs, >>>>>>>> does not wait for backups >>>>>>>> } >>>>>>>> >>>>>>>> After that, the "cancel" flag and all corresponding properties can >>> be >>>>>>>> removed (apart from the IGNITE_NO_SHUTDOWN_HOOK, if it is still >>>>> needed). >>>>>>>> >>>>>>>> Does this make sense? What do you think? >>>>>>>> >>>>>>>> -- >>>>>>>> With regards, >>>>>>>> Aleksandr Polovtsev >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> With regards, >>>>>> Aleksandr Polovtsev >>>>> >>>>> >>>> >>>> -- >>>> With regards, >>>> Aleksandr Polovtsev >>> >>> >> >> -- >> With regards, >> Aleksandr Polovtsev >> > > > -- > With regards, > Aleksandr Polovtsev