bug#53580: shepherd's architecture

2023-06-08 Thread Attila Lendvai
> Sorry to be direct: is there a concrete bug you’re reporting here?


i didn't pay careful enough attention to report something specific, but one 
thing that pops to mind:

when i'm working on my service code, which is `guix pull`ed in from my channel, 
then after a reconfigure i seem to have to reboot for my new code to get 
activated. a simple `herd restart` on the service didn't seem to be enough. 
i.e. the guile modules that my service code is using did not get reloaded into 
the PID 1 guile.

keep in mind that this is a non-trivial service that e.g. spawns a long-lived 
fiber to talk to the daemon through its stdio while the daemon is running. IOW, 
its start GEXP is not just a simple forkexec, but something more complex that 
uses functions from guile modules that should be reloaded into PID 1 when the 
new version of the service is to be started.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“The unexamined life is not worth living for a human being.”
— Socrates (c. 470–399 BC, tried and executed), 'Apology' (399 BC)






bug#53580: shepherd's architecture

2023-06-08 Thread Csepp


Ludovic Courtès  writes:

> Hi Attila,
>
> Attila Lendvai  skribis:
>
>> [forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise 
>> functional system]
>>
>>> So I think we’re mostly okay now. The one thing we could do is load
>>> the whole config file in a separate fiber, and maybe it’s fine to keep
>>> going even when there’s an error during config file evaluation?
>>>
>>> WDYT?
>>
>>
>> i think there's a fundamental issue to be resolved here, and
>> addressing that would implicitly resolve the entire class of issues
>> that this one belongs to.
>>
>> guile (shepherd) is run as the init process, and because of that it
>> may not exit or be respawn. but at the same time when we reconfigure
>> a guix system, then shepherd's config should not only be reloaded,
>> but its internal state merged with the new config, and potentially
>> even with an evolved shepherd codebase.
>
> Sorry to be direct: is there a concrete bug you’re reporting here?
>
>> i still lack a proper mental model of all this to succesfully
>> predict what will happen when i `guix system reconfigure` after i
>> `guix pull`-ed my service code, and/or changed the config of my
>> services.
>
> What happens is that ‘guix system reconfigure’ loads new services into
> the running shepherd.  New services simply get started; services for
> which a same-named service is already running instead get registered as
> a “replacement”, meaning that the new version of the service only gets
> started when the user explicitly runs ‘herd restart SERVICE’.
>
> Non-stop upgrades is ideal, but shepherd alone cannot do that.  For
> instance, nginx supports that, and no init system could implement that
> on its behalf.
>
> Ludo’.

Do services get a reference to their previously running version?
The Minix project was experimenting with supporting something like
supervisor trees for high uptime, and one way they were trying to
achieve that was by giving services the memory of their previous
version, so they could read their state and migrate it to their own
memory.





bug#53580: shepherd's architecture

2023-06-06 Thread Ludovic Courtès
Hi Attila,

Attila Lendvai  skribis:

> [forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise 
> functional system]
>
>> So I think we’re mostly okay now. The one thing we could do is load
>> the whole config file in a separate fiber, and maybe it’s fine to keep
>> going even when there’s an error during config file evaluation?
>>
>> WDYT?
>
>
> i think there's a fundamental issue to be resolved here, and addressing that 
> would implicitly resolve the entire class of issues that this one belongs to.
>
> guile (shepherd) is run as the init process, and because of that it may not 
> exit or be respawn. but at the same time when we reconfigure a guix system, 
> then shepherd's config should not only be reloaded, but its internal state 
> merged with the new config, and potentially even with an evolved shepherd 
> codebase.

Sorry to be direct: is there a concrete bug you’re reporting here?

> i still lack a proper mental model of all this to succesfully predict what 
> will happen when i `guix system reconfigure` after i `guix pull`-ed my 
> service code, and/or changed the config of my services.

What happens is that ‘guix system reconfigure’ loads new services into
the running shepherd.  New services simply get started; services for
which a same-named service is already running instead get registered as
a “replacement”, meaning that the new version of the service only gets
started when the user explicitly runs ‘herd restart SERVICE’.

Non-stop upgrades is ideal, but shepherd alone cannot do that.  For
instance, nginx supports that, and no init system could implement that
on its behalf.

Ludo’.





bug#53580: shepherd's architecture

2023-05-29 Thread Felix Lechner via Bug reports for GNU Guix
Hi Brian,

On Mon, May 29, 2023 at 8:02 AM Brian Cully via Development of GNU
Guix and the GNU System distribution.  wrote:
>
> Erlang has had hot code reloading for decades

Thank you for that pointer! I also had Erlang on my mind while reading
Attila's message.

> Lisp Flavoured Erlang exists if you want that syntax. There
> would definitely be advantages to writing an init (and, indeed,
> any service that needs 100% uptime) on top of the Erlang virtual
> machine.

“Twenty years from now you will be more disappointed by the things
that you didn't do than by the ones you did do. So throw off the
bowlines. Sail away from the safe harbor. Catch the trade winds in
your sails. Explore. Dream. Discover.” --- H. Jackson Brown Jr in
"P.S. I Love You"

Kind regards
Felix





bug#53580: shepherd's architecture

2023-05-29 Thread Brian Cully via Bug reports for GNU Guix



Attila Lendvai  writes:

it doesn't seem to be an insurmontable task to make sure that 
guile

can safely unlink a module from its heap, check if there are any
references into the module to be dropped, and then reload this 
module

from disk.

the already runing fibers would keep the required code in the 
heap
until after they are stopped/restarted. then the module would 
get GC'd

eventually.

this would help solve the problem that a reconfigured service 
may have
a completely different start/stop code. and by taking some 
careful
shortcuts we may be able to make reloading work without having 
to stop

the service process in question.


Erlang has had hot code reloading for decades, built around the 
needs of 100% uptime systems. The problem is more complex than it 
often appears to people who are used to how lisps traditionally do 
it. I strongly recommend reading up on Erlang's migration 
system. Briefly: you can't just swap out function definitions, 
because they rely on non-function state which needs to be migrated 
along with the function itself, and you can't do it whenever you 
want, because external actors may be relying on a view of the 
internal state. To accomplish this, Erlang has a lot of machinery, 
and it fits in to the core design of the language and runtime 
which would be extremely difficult to port over to non-Erlang 
languages. Doing it in Scheme is probably possible in an academic 
sense, but not in a practical one.


OTOH, Lisp Flavoured Erlang exists if you want that syntax. There 
would definitely be advantages to writing an init (and, indeed, 
any service that needs 100% uptime) on top of the Erlang virtual 
machine. But going the other way, by porting Erlang's 
functionality into Scheme, is going to be a wash.


in this setup most of the complexity and the evolution of the 
shepherd
codebase would happen in the runner, and the other two parts 
could be

kept minimal and would rarely need to change (and thus require a
reboot).


Accepting that dramatic enough changes to PID 1 are going to 
require a reboot seems reasonable to me. They should be even more 
rare than kernel updates, and we accept rebooting there already.


-bjc





bug#53580: shepherd's architecture

2023-05-28 Thread Attila Lendvai
[resending to include the guix-devel list. apologies for everyone who receives 
this mail twice!]

--

[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise 
functional system]


> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
> 
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing that 
would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not 
exit or be respawn. but at the same time when we reconfigure a guix system, 
then shepherd's config should not only be reloaded, but its internal state 
merged with the new config, and potentially even with an evolved shepherd 
codebase.

i still lack a proper mental model of all this to succesfully predict what will 
happen when i `guix system reconfigure` after i `guix pull`-ed my service code, 
and/or changed the config of my services.



this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such 
migrations, and the new shpeherd codebase could migrate its state from the old 
to the new, with most of the migration code being automatic. some of it must be 
hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a 
graph, and store it as one (as opposed to a string of characters); and our 
systems should have orthogonal persistency, etc, etc... a far cry from what we 
have now.

Fare's excellent blog has some visionary thoughts on this, especially in:

https://ngnghm.github.io/blog/2015/09/08/chapter-5-non-stop-change/

but given that we will not have these any time soon... what can we do now?



note: what follows are wild ideas, and i'm not sure i have the necessary 
understanding of the involved subsystems to properly judge their feasibility... 
so take them with a pinch of salt.

idea 1


it doesn't seem to be an insurmontable task to make sure that guile can safely 
unlink a module from its heap, check if there are any references into the 
module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after 
they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a 
completely different start/stop code. and by taking some careful shortcuts we 
may be able to make reloading work without having to stop the service process 
in question.

idea 2


another, probably better idea:

split up shepherd's codebase into isolated parts:

1) the init process

2) the service runners, which are spawned by 1). let's call this part
'the runner'.

3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that 
is created when the runner are spawn. i.e. here we wouldn't need an IPC socket 
file like we need for the communication between the scripts and the init 
process.

AFAIU the internal structure of shepherd is already turning into something like 
this with the use of fibers and channels. i suspect Ludo has something like 
this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase 
would happen in the runner, and the other two parts could be kept minimal and 
would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of 
the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when 
the daemon is stopped or it quits unexpectedly.



recently i've succesfully wrote a shepherd service that spawns a daemon, and 
from a fiber it does two way communication with the daemon using a pipe 
connected to the daemon's stdio. i guess that counts as a proof of concept for 
the second idea, but i'm not sure about its stability. a stuck/failing service 
is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/services/swarm.scm#L315

the fiber's code that talks to it:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/swarm-utils.scm#L133

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Dying societies accumulate laws like dying men accumulate remedies.”
—  Nicolás Gómez Dávila (1913–1994), 'Escolios a un texto implicito: 

bug#53580: shepherd's architecture

2023-05-27 Thread Attila Lendvai
[forked from: bug#53580: /var/run/shepherd/socket is missing on an otherwise 
functional system]

> So I think we’re mostly okay now. The one thing we could do is load
> the whole config file in a separate fiber, and maybe it’s fine to keep
> going even when there’s an error during config file evaluation?
>
> WDYT?


i think there's a fundamental issue to be resolved here, and addressing that 
would implicitly resolve the entire class of issues that this one belongs to.

guile (shepherd) is run as the init process, and because of that it may not 
exit or be respawn. but at the same time when we reconfigure a guix system, 
then shepherd's config should not only be reloaded, but its internal state 
merged with the new config, and potentially even with an evolved shepherd 
codebase.

i still lack a proper mental model of all this to succesfully predict what will 
happen when i `guix system reconfigure` after i `guix pull`-ed my service code, 
and/or changed the config of my services.



this problem of migration is pretty much a CS research topic...

ideally, there should be a non-shepherd-specific protocol defined for such 
migrations, and the new shpeherd codebase could migrate its state from the old 
to the new, with most of the migration code being automatic. some of it must be 
hand written as rquired by some semantic changes.

even more ideally, we should reflexive systems; admit that source code is a 
graph, and store it as one (as opposed to a string of characters); and our 
systems should have orthogonal persistency, etc, etc... a far cry from what we 
have now.

Fare's excellent blog has some visionary thoughts on this, especially in:

https://ngnghm.github.io/blog/2015/09/08/chapter-5-non-stop-change/

but given that we will not have these any time soon... what can we do now?



note: what follows are wild ideas, and i'm not sure i have the necessary 
understanding of the involved subsystems to properly judge their feasibility... 
so take them with a pinch of salt.

idea 1


it doesn't seem to be an insurmontable task to make sure that guile can safely 
unlink a module from its heap, check if there are any references into the 
module to be dropped, and then reload this module from disk.

the already runing fibers would keep the required code in the heap until after 
they are stopped/restarted. then the module would get GC'd eventually.

this would help solve the problem that a reconfigured service may have a 
completely different start/stop code. and by taking some careful shortcuts we 
may be able to make reloading work without having to stop the service process 
in question.

idea 2


another, probably better idea:

split up shepherd's codebase into isolated parts:

 1) the init process

 2) the service runners, which are spawned by 1). let's call this part
'the runner'.

 3) the CLI scripts that implement stuff like `reboot` by sending a
message to 1).

the runner would spawn and manage the actual daemon binaries/processes.

the init process would communicate with the runners through a channel/pipe that 
is created when the runner are spawn. i.e. here we wouldn't need an IPC socket 
file like we need for the communication between the scripts and the init 
process.

AFAIU the internal structure of shepherd is already turning into something like 
this with the use of fibers and channels. i suspect Ludo has something like 
this on his mind already.

in this setup most of the complexity and the evolution of the shepherd codebase 
would happen in the runner, and the other two parts could be kept minimal and 
would rarely need to change (and thus require a reboot).

the need for a reboot could be detected by noticing that the compiled binary of 
the init process has changed compared to what is currently running as PID 1.

the driver process of a service could be reloaded/respawned the next time when 
the daemon is stopped or it quits unexpectedly.



recently i've succesfully wrote a shepherd service that spawns a daemon, and 
from a fiber it does two way communication with the daemon using a pipe 
connected to the daemon's stdio. i guess that counts as a proof of concept for 
the second idea, but i'm not sure about its stability. a stuck/failing service 
is a different issue than a stuck/failing init process.

for reference, the spawning of the daemon:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/services/swarm.scm#L315

the fiber's code that talks to it:

https://github.com/attila-lendvai/guix-crypto/blob/8f996239bb8c2a1103c3e54605faf680fe1ed093/src/guix-crypto/swarm-utils.scm#L133

--
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“We reject: kings, presidents and voting. We believe in: rough consensus and 
running code.”
— David Clark for the IETF