Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-03-10 Thread Vincent Blut

Control: tags -1 pending

Santiago, Michael, does that changelog entry about this issue feels 
clear to you?


https://salsa.debian.org/debian/chrony/-/blob/master/debian/changelog


On 2020-01-02T13:38+0100, Santiago Vila wrote:

Package: chrony
Version: 3.4-4
Severity: important

Dear maintainer:

Apparently, installing chrony does not ensure at all that it will work.

Google has moved from ntp in Debian 9 to chrony in Debian 10 for their
default Debian GCE images, and I discovered this on a lot of GCE
instances having a clock several minutes off.

The problem I found is very similar to the one described here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=933370

I believe the best summary of the problem was made by Michael Biebl
here:

https://github.com/systemd/systemd/issues/7104#issuecomment-471329392

Quoting Michael:

As it stands, the current practice of having systemd-timesyncd.service
enabled by default (in Debian) and alternative implementations like
chrony or ntpd declare Conflicts=systemd-timesyncd.service in their
service file does not work reliably.



AFAIK, this has been fixed on the systemd side in version 241-3 by
dropping the "Conflicts" systemd had on chrony or ntpd.

Unfortunately, AFAIK, conflicts are bi-directional, so apparently the
problem will persist in buster as far as chrony still has conflicts
in the systemd unit file.

I've noticed this problem happens randomly (it happens in some
instances, it does not happens in some others), so I don't have a
"recipe" as such to reproduce it.

However, I have a particular instance at GCE showing this behaviour
which I could try to clone to give you ssh access if required (please
contact me privately for details).

The behaviour is the following:

Both systemd-timesyncd and chrony are enabled (which is the default
on GCE instances). Just after a reboot, "systemctl status chrony" shows this:

● chrony.service - chrony, an NTP client/server
  Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: 
enabled)
  Active: inactive (dead)
  Docs: man:chronyd(8)
  man:chronyc(1)
  man:chrony.conf(5)

and I see this in the boot log:

systemd[1]: Condition check resulted in Network Time Synchronization being 
skipped.

If I do "systemctl disable systemd-timesyncd" and reboot, chrony is
properly loaded and it runs.

If I do "systemctl enable systemd-timesyncd" again and reboot, chrony
is shown as "inactive (dead)" again.

At this point, if I edit 
/etc/systemd/system/multi-user.target.wants/chrony.service
to remove the Conflicts line and reboot, chrony is properly loaded and it runs.

[ I'm reporting this as "important" because I believe it to be the kind
 of bug that should be fixed in a point release of Debian 10 ].

[ Cc to Michael Biebl in case he would like to comment on the issue ].

Thanks.


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-02-01 Thread Santiago Vila
On Sat, Feb 01, 2020 at 10:46:24PM +0100, Michael Biebl wrote:

> > Would it make sense to ship systemd-timesyncd disabled by default in
> > buster and add a README to enable it only if the user really decides
> > to enable it? (Maybe also documenting this in the release notes).
> 
> I think this would break more setups then it fixes.
> The default behaviour of systemd-timesyncd has been since two releases
> to be enabled by default. We can't easily change that.
> 
> > That would be the most simple solution for stable that I can think,
> > as it would reduce the number of packages to change to just one.
> 
> Unfortunately I think that disabling systemd-timesyncd by default is one
> of the most intrusive changes. After all, systemd is installed by
> default (and thus systemd-timesyncd enabled by default). I fear this is
> a no-go.

Ok, in such case, the only other solution which comes to mind is what you
proposed in the message I was replying to, i.e. this:

> I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
> and virtualbox [1] to drop the Conflicts= line again.

If I'm not mistaken, this is how it was done in stretch, so the fix
would be as "conservative" as it can be.

I would not worry about the number of packages that need to be changed
being "high". If you as systemd maintainer believe that this is the best
solution for buster, I would hope that the other maintainers agree.

Thanks.



Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-02-01 Thread Vincent Blut

On 2020-02-01T22:46+0100, Michael Biebl wrote:

Am 01.02.20 um 22:37 schrieb Santiago Vila:

On Sun, Jan 12, 2020 at 08:41:19PM +0100, Michael Biebl wrote:


I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
and virtualbox [1] to drop the Conflicts= line again.
Maybe we should even do that for buster via a stable upload?

Thoughts?


Hi. I definitely think this should be fixed in stable, in whatever way
it's considered best for stable.

The last thing a system admin would expect from Debian stable is that
the clock is off by several minutes in a system where a time-keeping
package like ntp or chrony is present. This was completely unexpected
for me. (In fact, I would have reported this as serious but I prefer
to concentrate on finding a good fix).

Regarding the proper fix: Anything which makes chrony and ntp work
again (without surprises) would do. I agree that the less intrusive
the change, the better.

In the Debian 10 instances at GCE where I found this I just did this:

systemctl disable systemd-timesyncd

Would it make sense to ship systemd-timesyncd disabled by default in
buster and add a README to enable it only if the user really decides
to enable it? (Maybe also documenting this in the release notes).


I think this would break more setups then it fixes.
The default behaviour of systemd-timesyncd has been since two releases
to be enabled by default. We can't easily change that.


That would be the most simple solution for stable that I can think,
as it would reduce the number of packages to change to just one.


Unfortunately I think that disabling systemd-timesyncd by default is one
of the most intrusive changes. After all, systemd is installed by
default (and thus systemd-timesyncd enabled by default). I fear this is
a no-go.


I wholeheartedly agree with you, Michael. To me having an {S}NTP 
implementation by default is a must. Disabling systemd-timesyncd at this 
point would probably seen as a serious regression.


Despite what Balint thinks, I think that we, maintainers of alternative 
NTP implementations, should just stop conflicting with 
systemd-timesyncd.service on stable and oldstable. Regarding Bulleye, 
that should not be an issue if Balint and Michael’s work is pushed in 
the archive.


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-02-01 Thread Michael Biebl
Am 01.02.20 um 22:37 schrieb Santiago Vila:
> On Sun, Jan 12, 2020 at 08:41:19PM +0100, Michael Biebl wrote:
> 
>> I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
>> and virtualbox [1] to drop the Conflicts= line again.
>> Maybe we should even do that for buster via a stable upload?
>>
>> Thoughts?
> 
> Hi. I definitely think this should be fixed in stable, in whatever way
> it's considered best for stable.
> 
> The last thing a system admin would expect from Debian stable is that
> the clock is off by several minutes in a system where a time-keeping
> package like ntp or chrony is present. This was completely unexpected
> for me. (In fact, I would have reported this as serious but I prefer
> to concentrate on finding a good fix).
> 
> Regarding the proper fix: Anything which makes chrony and ntp work
> again (without surprises) would do. I agree that the less intrusive
> the change, the better.
> 
> In the Debian 10 instances at GCE where I found this I just did this:
> 
> systemctl disable systemd-timesyncd
> 
> Would it make sense to ship systemd-timesyncd disabled by default in
> buster and add a README to enable it only if the user really decides
> to enable it? (Maybe also documenting this in the release notes).

I think this would break more setups then it fixes.
The default behaviour of systemd-timesyncd has been since two releases
to be enabled by default. We can't easily change that.

> That would be the most simple solution for stable that I can think,
> as it would reduce the number of packages to change to just one.

Unfortunately I think that disabling systemd-timesyncd by default is one
of the most intrusive changes. After all, systemd is installed by
default (and thus systemd-timesyncd enabled by default). I fear this is
a no-go.




signature.asc
Description: OpenPGP digital signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-02-01 Thread Santiago Vila
On Sun, Jan 12, 2020 at 08:41:19PM +0100, Michael Biebl wrote:

> I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
> and virtualbox [1] to drop the Conflicts= line again.
> Maybe we should even do that for buster via a stable upload?
> 
> Thoughts?

Hi. I definitely think this should be fixed in stable, in whatever way
it's considered best for stable.

The last thing a system admin would expect from Debian stable is that
the clock is off by several minutes in a system where a time-keeping
package like ntp or chrony is present. This was completely unexpected
for me. (In fact, I would have reported this as serious but I prefer
to concentrate on finding a good fix).

Regarding the proper fix: Anything which makes chrony and ntp work
again (without surprises) would do. I agree that the less intrusive
the change, the better.

In the Debian 10 instances at GCE where I found this I just did this:

systemctl disable systemd-timesyncd

Would it make sense to ship systemd-timesyncd disabled by default in
buster and add a README to enable it only if the user really decides
to enable it? (Maybe also documenting this in the release notes).

That would be the most simple solution for stable that I can think,
as it would reduce the number of packages to change to just one.

Thanks.



Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-13 Thread Vincent Blut

Hi Feri,

On 2020-01-09T19:31+0100, wf...@niif.hu wrote:

Hi,

Don't you think #873057 is actually the same issue discovered with ntp?
Please consider merging it in if you agree.


Indeed, this is the same issue.


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Vincent Blut

On 2020-01-13T00:48+0100, Michael Biebl wrote:

Am 13.01.20 um 00:43 schrieb Michael Biebl:

Am 13.01.20 um 00:05 schrieb Vincent Blut:

On 2020-01-12T23:24+0100, Michael Biebl wrote:

Am 12.01.20 um 23:08 schrieb Vincent Blut:



We probably should not keep this hack forever, but instead let timedated
read known NTP implementation unit names from
usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in
systemd 243. I added the necessary bits in chrony 3.5-5.


Hm, I don't see how this change in timedate would actually help in this
situation.
Support for ntp-units.d in timedated/timedatectl just enables that if
you use "timedatectl set-ntp true|false" it will prefer alternatives if
installed.


Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when
starting?
I thought that was the case and that it remained inactive in case an
NTP implementation with a higher priority was found there.


Reading through the diff at
https://github.com/systemd/systemd/pull/13136 that doesn't seem to be
the case. ntp-units.d is only used by timedated, not timesyncd.




Fwiw, there is this related MR

https://salsa.debian.org/systemd-team/systemd/merge_requests/61


I mentioned another bug report about a similar issue with ntp.


Ideally, only one NTP client should be enabled at a time.


Note that we restore timesyncd (if it is enabled) on chrony removal to 
not wait for the next reboot to get the clock synchronized again, so if 
having only one NTP client enabled at a time is where we are heading 
for, then I’ll have to drop that code snippet.


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Michael Biebl
Am 13.01.20 um 00:43 schrieb Michael Biebl:
> Am 13.01.20 um 00:05 schrieb Vincent Blut:
>> On 2020-01-12T23:24+0100, Michael Biebl wrote:
>>> Am 12.01.20 um 23:08 schrieb Vincent Blut:
> 
 We probably should not keep this hack forever, but instead let timedated
 read known NTP implementation unit names from
 usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in
 systemd 243. I added the necessary bits in chrony 3.5-5.
>>>
>>> Hm, I don't see how this change in timedate would actually help in this
>>> situation.
>>> Support for ntp-units.d in timedated/timedatectl just enables that if
>>> you use "timedatectl set-ntp true|false" it will prefer alternatives if
>>> installed.
>>
>> Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when
>> starting?
>> I thought that was the case and that it remained inactive in case an
>> NTP implementation with a higher priority was found there.
> 
> Reading through the diff at
> https://github.com/systemd/systemd/pull/13136 that doesn't seem to be
> the case. ntp-units.d is only used by timedated, not timesyncd.
> 
> 

Fwiw, there is this related MR

https://salsa.debian.org/systemd-team/systemd/merge_requests/61

Ideally, only one NTP client should be enabled at a time.



signature.asc
Description: OpenPGP digital signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Michael Biebl
Am 13.01.20 um 00:05 schrieb Vincent Blut:
> On 2020-01-12T23:24+0100, Michael Biebl wrote:
>> Am 12.01.20 um 23:08 schrieb Vincent Blut:

>>> We probably should not keep this hack forever, but instead let timedated
>>> read known NTP implementation unit names from
>>> usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in
>>> systemd 243. I added the necessary bits in chrony 3.5-5.
>>
>> Hm, I don't see how this change in timedate would actually help in this
>> situation.
>> Support for ntp-units.d in timedated/timedatectl just enables that if
>> you use "timedatectl set-ntp true|false" it will prefer alternatives if
>> installed.
> 
> Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when
> starting?
> I thought that was the case and that it remained inactive in case an
> NTP implementation with a higher priority was found there.

Reading through the diff at
https://github.com/systemd/systemd/pull/13136 that doesn't seem to be
the case. ntp-units.d is only used by timedated, not timesyncd.




signature.asc
Description: OpenPGP digital signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Vincent Blut

On 2020-01-12T23:24+0100, Michael Biebl wrote:

Am 12.01.20 um 23:08 schrieb Vincent Blut:

On 2020-01-12T20:41+0100, Michael Biebl wrote:

Am 12.01.20 um 20:15 schrieb Santiago Vila:

My theory is that this is some kind of race condition.

I initially cloned the machine from another one where this happened.

Then I discovered that the problem also happens (randomly) in a brand
new machine if I copy the journal from the "bad" machine.

However, there is nothing special about the journal (or at least
"journalctl --verify" says it's ok), except maybe that it's several
megabytes long.

Could it be that systemd spends some time processing the journal at
boot time and this is what triggers the race condition?


On my buster system:



an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking
at job systemd-timesyncd.service/stop conflicted_by=yes
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking
at job systemd-timesyncd.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing
conflicting jobs
systemd-timesyncd.service/stop,systemd-timesyncd.service/start by
deleting job systemd-timesyncd.service/start
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job
chrony.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job
chrony.service/stop conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting
jobs chrony.service/start,chrony.service/stop by deleting job
chrony.service/stop
Jan 12 20:04:46 debian systemd[1]: Found redundant job
systemd-timesyncd.service/stop, dropping from transaction.



Those lines are missing on GCE system. Instead I see

Jan 12 17:02:01 d1 systemd[1]: Keeping job
systemd-timesyncd.service/start because of sysinit.target/start
Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop
because of systemd-timesyncd.service/start
Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed
new job systemd-timesyncd.service/start as 119
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82
chrony.service/start finished, result=canceled
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job
chrony.service/stop as 121
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121
chrony.service/stop finished, result=done



The problem is, that the transaction is computed *before*
ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions
are evaluation just before the binary is executed) and it might indeed
depend on the ordering, in which the jobs are scheduled.


Makes sense. Thanks for debugging this.


So, the simplest fix would be to drop the line

Conflicts=systemd-timesyncd.service openntpd.service ntp.service
ntpsec.service

from chrony.service.
This way, systemd will schedule the start of both services.
chrony.service will be started properly and for
systemd-timesyncd.service the ConditionFileIsExecutable will kick in.


If there is no risk of regression, then I’m all for making this change.


I don't see a risk for a regression, given that systemd in buster ships
disable-with-time-daemon.conf.
So if the drop the Conflicts again, we'd basically have the situation
again as in stretch.


Not for chrony and openntpd. Both were already in conflict with 
systemd-timesyncd.service in Stretch.



That said, we should probably first upload this change to sid to give it
some wider exposure first.


Sure.

We probably should not keep this hack forever, but instead let 
timedated

read known NTP implementation unit names from
usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in
systemd 243. I added the necessary bits in chrony 3.5-5.


Hm, I don't see how this change in timedate would actually help in this
situation.
Support for ntp-units.d in timedated/timedatectl just enables that if
you use "timedatectl set-ntp true|false" it will prefer alternatives if
installed.


Doesn’t systemd-timesyncd look for foreign services in ntp-units.d/ when 
starting?
I thought that was the case and that it remained inactive in case an 
NTP implementation with a higher priority was found there.



We don't actually use "timedatectl set-ntp true|false" though in our
maintainer scripts though (and I don't think we should).


Agreed.


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Michael Biebl
Am 12.01.20 um 23:08 schrieb Vincent Blut:
> On 2020-01-12T20:41+0100, Michael Biebl wrote:
>> Am 12.01.20 um 20:15 schrieb Santiago Vila:
>>> My theory is that this is some kind of race condition.
>>>
>>> I initially cloned the machine from another one where this happened.
>>>
>>> Then I discovered that the problem also happens (randomly) in a brand
>>> new machine if I copy the journal from the "bad" machine.
>>>
>>> However, there is nothing special about the journal (or at least
>>> "journalctl --verify" says it's ok), except maybe that it's several
>>> megabytes long.
>>>
>>> Could it be that systemd spends some time processing the journal at
>>> boot time and this is what triggers the race condition?
>>
>> On my buster system:
>>
>>
>>> an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking
>>> at job systemd-timesyncd.service/stop conflicted_by=yes
>>> Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking
>>> at job systemd-timesyncd.service/start conflicted_by=no
>>> Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing
>>> conflicting jobs
>>> systemd-timesyncd.service/stop,systemd-timesyncd.service/start by
>>> deleting job systemd-timesyncd.service/start
>>> Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job
>>> chrony.service/start conflicted_by=no
>>> Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job
>>> chrony.service/stop conflicted_by=no
>>> Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting
>>> jobs chrony.service/start,chrony.service/stop by deleting job
>>> chrony.service/stop
>>> Jan 12 20:04:46 debian systemd[1]: Found redundant job
>>> systemd-timesyncd.service/stop, dropping from transaction.
>>
>>
>> Those lines are missing on GCE system. Instead I see
>>> Jan 12 17:02:01 d1 systemd[1]: Keeping job
>>> systemd-timesyncd.service/start because of sysinit.target/start
>>> Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop
>>> because of systemd-timesyncd.service/start
>>> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed
>>> new job systemd-timesyncd.service/start as 119
>>> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82
>>> chrony.service/start finished, result=canceled
>>> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job
>>> chrony.service/stop as 121
>>> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121
>>> chrony.service/stop finished, result=done
>>
>>
>> The problem is, that the transaction is computed *before*
>> ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions
>> are evaluation just before the binary is executed) and it might indeed
>> depend on the ordering, in which the jobs are scheduled.
> 
> Makes sense. Thanks for debugging this.
> 
>> So, the simplest fix would be to drop the line
>>> Conflicts=systemd-timesyncd.service openntpd.service ntp.service
>>> ntpsec.service
>> from chrony.service.
>> This way, systemd will schedule the start of both services.
>> chrony.service will be started properly and for
>> systemd-timesyncd.service the ConditionFileIsExecutable will kick in.
> 
> If there is no risk of regression, then I’m all for making this change.

I don't see a risk for a regression, given that systemd in buster ships
disable-with-time-daemon.conf.
So if the drop the Conflicts again, we'd basically have the situation
again as in stretch.

That said, we should probably first upload this change to sid to give it
some wider exposure first.


> We probably should not keep this hack forever, but instead let timedated
> read known NTP implementation unit names from
> usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in
> systemd 243. I added the necessary bits in chrony 3.5-5.

Hm, I don't see how this change in timedate would actually help in this
situation.
Support for ntp-units.d in timedated/timedatectl just enables that if
you use "timedatectl set-ntp true|false" it will prefer alternatives if
installed.

We don't actually use "timedatectl set-ntp true|false" though in our
maintainer scripts though (and I don't think we should).





signature.asc
Description: OpenPGP digital signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Vincent Blut

On 2020-01-12T20:41+0100, Michael Biebl wrote:

Am 12.01.20 um 20:15 schrieb Santiago Vila:

My theory is that this is some kind of race condition.

I initially cloned the machine from another one where this happened.

Then I discovered that the problem also happens (randomly) in a brand
new machine if I copy the journal from the "bad" machine.

However, there is nothing special about the journal (or at least
"journalctl --verify" says it's ok), except maybe that it's several
megabytes long.

Could it be that systemd spends some time processing the journal at
boot time and this is what triggers the race condition?


On my buster system:



an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job 
systemd-timesyncd.service/stop conflicted_by=yes
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job 
systemd-timesyncd.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing 
conflicting jobs systemd-timesyncd.service/stop,systemd-timesyncd.service/start 
by deleting job systemd-timesyncd.service/start
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job 
chrony.service/start conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job 
chrony.service/stop conflicted_by=no
Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting jobs 
chrony.service/start,chrony.service/stop by deleting job chrony.service/stop
Jan 12 20:04:46 debian systemd[1]: Found redundant job 
systemd-timesyncd.service/stop, dropping from transaction.



Those lines are missing on GCE system. Instead I see

Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start 
because of sysinit.target/start
Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of 
systemd-timesyncd.service/start
Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job 
systemd-timesyncd.service/start as 119
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start 
finished, result=canceled
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job 
chrony.service/stop as 121
Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop 
finished, result=done



The problem is, that the transaction is computed *before*
ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions
are evaluation just before the binary is executed) and it might indeed
depend on the ordering, in which the jobs are scheduled.


Makes sense. Thanks for debugging this.


So, the simplest fix would be to drop the line

Conflicts=systemd-timesyncd.service openntpd.service ntp.service ntpsec.service

from chrony.service.
This way, systemd will schedule the start of both services.
chrony.service will be started properly and for
systemd-timesyncd.service the ConditionFileIsExecutable will kick in.


If there is no risk of regression, then I’m all for making this change.


I'm sorry, that I first asked chrony (and other NTP implementations) to
add a Conflicts=systemd-timesyncd.service, hoping I could get rid of the
/lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf
hack.


Don’t feel sorry Michael, that was even requested by systemd upstream.
From systemd changelog:
“… Alternative NTP implementations should add a   
Conflicts=systemd-timesyncd.service to their unit files to take over and 
replace systemd's NTP default functionality.”



But it appears, this hack is actually much more reliable then using
Conflicts=
See
https://github.com/systemd/systemd/issues/7104#issuecomment-456755980
which explains the problem. We'd need some kind of asymmetrical conflicts.


We probably should not keep this hack forever, but instead let timedated 
read known NTP implementation unit names from 
usr/lib/systemd/ntp-units.d/*.list since this has been reintroduced in 
systemd 243. I added the necessary bits in chrony 3.5-5.


By the way, it would be great to know if this issue can be triggered by 
upgrading chrony to the 3.5-5 version (and with systemd >= 243-1).



I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
and virtualbox [1] to drop the Conflicts= line again.
Maybe we should even do that for buster via a stable upload?

Thoughts?


Seems good.


Michael


Cheers,
Vincent


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Michael Biebl
Am 12.01.20 um 20:15 schrieb Santiago Vila:
> My theory is that this is some kind of race condition.
> 
> I initially cloned the machine from another one where this happened.
> 
> Then I discovered that the problem also happens (randomly) in a brand
> new machine if I copy the journal from the "bad" machine.
> 
> However, there is nothing special about the journal (or at least
> "journalctl --verify" says it's ok), except maybe that it's several
> megabytes long.
> 
> Could it be that systemd spends some time processing the journal at
> boot time and this is what triggers the race condition?

On my buster system:


> an 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job 
> systemd-timesyncd.service/stop conflicted_by=yes
> Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Looking at job 
> systemd-timesyncd.service/start conflicted_by=no
> Jan 12 20:04:46 debian systemd[1]: systemd-timesyncd.service: Fixing 
> conflicting jobs 
> systemd-timesyncd.service/stop,systemd-timesyncd.service/start by deleting 
> job systemd-timesyncd.service/start
> Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job 
> chrony.service/start conflicted_by=no
> Jan 12 20:04:46 debian systemd[1]: chrony.service: Looking at job 
> chrony.service/stop conflicted_by=no
> Jan 12 20:04:46 debian systemd[1]: chrony.service: Fixing conflicting jobs 
> chrony.service/start,chrony.service/stop by deleting job chrony.service/stop
> Jan 12 20:04:46 debian systemd[1]: Found redundant job 
> systemd-timesyncd.service/stop, dropping from transaction.


Those lines are missing on GCE system. Instead I see
> Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start 
> because of sysinit.target/start
> Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job 
> systemd-timesyncd.service/start as 119
> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start 
> finished, result=canceled
> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job 
> chrony.service/stop as 121
> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop 
> finished, result=done


The problem is, that the transaction is computed *before*
ConditionFileIsExecutable=!/usr/sbin/chronyd is evaluated (conditions
are evaluation just before the binary is executed) and it might indeed
depend on the ordering, in which the jobs are scheduled.

So, the simplest fix would be to drop the line
> Conflicts=systemd-timesyncd.service openntpd.service ntp.service 
> ntpsec.service
from chrony.service.
This way, systemd will schedule the start of both services.
chrony.service will be started properly and for
systemd-timesyncd.service the ConditionFileIsExecutable will kick in.

I'm sorry, that I first asked chrony (and other NTP implementations) to
add a Conflicts=systemd-timesyncd.service, hoping I could get rid of the
/lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf
hack.

But it appears, this hack is actually much more reliable then using
Conflicts=
See
https://github.com/systemd/systemd/issues/7104#issuecomment-456755980
which explains the problem. We'd need some kind of asymmetrical conflicts.


I guess at this point it is best to ask chrony, ntp, openntpd, ntpsec
and virtualbox [1] to drop the Conflicts= line again.
Maybe we should even do that for buster via a stable upload?

Thoughts?

Michael


[1]
https://codesearch.debian.net/search?q=Conflicts%3Dsystemd-timesyncd.service=1



signature.asc
Description: OpenPGP digital signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Santiago Vila
My theory is that this is some kind of race condition.

I initially cloned the machine from another one where this happened.

Then I discovered that the problem also happens (randomly) in a brand
new machine if I copy the journal from the "bad" machine.

However, there is nothing special about the journal (or at least
"journalctl --verify" says it's ok), except maybe that it's several
megabytes long.

Could it be that systemd spends some time processing the journal at
boot time and this is what triggers the race condition?

Thanks.



Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Michael Biebl
Am 12.01.20 um 18:18 schrieb Michael Biebl:
> Am 04.01.20 um 20:22 schrieb Santiago Vila:
>> Hi. I managed to put the instances in a state where a reboot does not
>> make the problem to go away. Please give it a try.
> 
> I could reproduce the issue on the system provided by Santiago.
> 
> I'm dumping the relevant parts of the log here:

From a buster system of mine which has chrony and systemd-timesyncd
installed and enabled:

> ● chrony.service - chrony, an NTP client/server
>Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor 
> preset: enabled)
>Active: active (running) since Sun 2020-01-12 19:48:09 CET; 1min 15s ago
>  Docs: man:chronyd(8)
>man:chronyc(1)
>man:chrony.conf(5)
>   Process: 395 ExecStart=/usr/sbin/chronyd $DAEMON_OPTS (code=exited, 
> status=0/SUCCESS)
>   Process: 404 ExecStartPost=/usr/lib/chrony/chrony-helper update-daemon 
> (code=exited, status=0/SUCCESS)
>  Main PID: 401 (chronyd)
> Tasks: 2 (limit: 1149)
>Memory: 3.3M
>CGroup: /system.slice/chrony.service
>├─401 /usr/sbin/chronyd -F -1
>└─403 /usr/sbin/chronyd -F -1
> 
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Control process exited, 
> code=exited, status=0/SUCCESS
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Got final SIGCHLD for 
> state start-post.
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Changed start-post -> 
> running
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Job 75 
> chrony.service/start finished, result=done
> Jan 12 19:48:09 debian systemd[1]: Started chrony, an NTP client/server.
> Jan 12 19:48:14 debian systemd[1]: chrony.service: chrony.service/stop would 
> stop a running service.
> Jan 12 19:48:14 debian systemd[1]: chrony.service: Deleting 
> chrony.service/stop to minimize impact.
> Jan 12 19:48:17 debian chronyd[401]: Selected source 162.159.200.1
> Jan 12 19:48:18 debian chronyd[401]: Selected source 173.249.58.145
> Jan 12 19:48:19 debian chronyd[401]: Source 129.70.132.37 replaced with 
> 2a01:4f8:110:12c8::123
> 
> ● systemd-timesyncd.service - Network Time Synchronization
>Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; 
> vendor preset: enabled)
>   Drop-In: /usr/lib/systemd/system/systemd-timesyncd.service.d
>└─disable-with-time-daemon.conf
>Active: inactive (dead)
>  Docs: man:systemd-timesyncd.service(8)



> Jan 12 19:48:08 debian systemd-sysv-generator[232]: Native unit for 
> chrony.service already exists, skipping.
> Jan 12 19:48:09 debian audit[296]: AVC apparmor="STATUS" 
> operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" 
> pid=296 comm="apparmor_parser"
> Jan 12 19:48:09 debian kernel: audit: type=1400 audit(1578854889.112:7): 
> apparmor="STATUS" operation="profile_load" profile="unconfined" 
> name="/usr/sbin/chronyd" pid=296 comm="apparmor_parser"
> Jan 12 19:48:09 debian chronyd[401]: chronyd version 3.4 starting (+CMDMON 
> +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +SECHASH +IPV6 
> -DEBUG)
> Jan 12 19:48:09 debian chronyd[401]: Frequency 0.000 +/- 100.000 ppm read 
> from /var/lib/chrony/chrony.drift
> Jan 12 19:48:09 debian chronyd[401]: Loaded seccomp filter
> Jan 12 19:48:09 debian systemd[1]: chrony.service: 
> ConditionCapability=CAP_SYS_TIME succeeded.
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Passing 0 fds to service
> Jan 12 19:48:09 debian systemd[1]: chrony.service: About to execute: 
> /usr/sbin/chronyd $DAEMON_OPTS
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Forked /usr/sbin/chronyd 
> as 395
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Changed dead -> start
> Jan 12 19:48:09 debian systemd[1]: Starting chrony, an NTP client/server...
> Jan 12 19:48:09 debian systemd[395]: Successfully mounted 
> /tmp/systemd-private-c32de5dec7564bdd85daa27163fecfdb-chrony.service-2LqUhJ/tmp
>  to /run/systemd/unit-root/tmp
> Jan 12 19:48:09 debian systemd[395]: Successfully mounted 
> /var/tmp/systemd-private-c32de5dec7564bdd85daa27163fecfdb-chrony.service-aYqukO/tmp
>  to /run/systemd/unit-root/var/tmp
> Jan 12 19:48:09 debian systemd[395]: chrony.service: Executing: 
> /usr/sbin/chronyd -F -1
> Jan 12 19:48:09 debian systemd[1]: Received SIGCHLD from PID 400 (chronyd).
> Jan 12 19:48:09 debian systemd[1]: Child 395 (chronyd) died (code=exited, 
> status=0/SUCCESS)
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Child 395 belongs to 
> chrony.service.
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Control process exited, 
> code=exited, status=0/SUCCESS
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Got final SIGCHLD for 
> state start.
> Jan 12 19:48:09 debian systemd[1]: chrony.service: New main PID 401 belongs 
> to service, we are happy.
> Jan 12 19:48:09 debian systemd[1]: chrony.service: Main PID loaded: 401
> Jan 12 19:48:09 debian systemd[1]: chrony.service: About to execute: 
> /usr/lib/chrony/chrony-helper update-daemon
> 

Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-12 Thread Michael Biebl
Am 04.01.20 um 20:22 schrieb Santiago Vila:
> Hi. I managed to put the instances in a state where a reboot does not
> make the problem to go away. Please give it a try.

I could reproduce the issue on the system provided by Santiago.

I'm dumping the relevant parts of the log here:

> root@d1:~# journalctl -b | egrep "chrony.service|timesyncd.service"
> Jan 12 17:01:51 d1 systemd-sysv-generator[202]: Native unit for 
> chrony.service already exists, skipping.
> Jan 12 17:02:01 d1 systemd[1]: Pulling in systemd-timesyncd.service/start 
> from sysinit.target/start
> Jan 12 17:02:01 d1 systemd[1]: Added job systemd-timesyncd.service/start to 
> transaction.
> Jan 12 17:02:01 d1 systemd[1]: Pulling in system.slice/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: Pulling in -.mount/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: Pulling in time-sync.target/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: Pulling in shutdown.target/stop from 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: Pulling in chrony.service/stop from 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: Added job chrony.service/stop to transaction.
> Jan 12 17:02:02 d1 ifup[263]: Executing dbus call 
> org.freedesktop.systemd1.Manager TryRestartUnit(systemd-timesyncd.service, 
> replace)
> Jan 12 17:02:01 d1 systemd[1]: Keeping job systemd-timesyncd.service/start 
> because of sysinit.target/start
> Jan 12 17:02:01 d1 systemd[1]: Keeping job chrony.service/stop because of 
> systemd-timesyncd.service/start
> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Installed new job 
> systemd-timesyncd.service/start as 119
> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 82 chrony.service/start 
> finished, result=canceled
> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Installed new job 
> chrony.service/stop as 121
> Jan 12 17:02:01 d1 systemd[1]: chrony.service: Job 121 chrony.service/stop 
> finished, result=done
> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: 
> ConditionFileIsExecutable=!/usr/sbin/chronyd failed.
> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Starting requested 
> but condition failed. Not starting unit.
> Jan 12 17:02:01 d1 systemd[1]: systemd-timesyncd.service: Job 119 
> systemd-timesyncd.service/start finished, result=done
> Jan 12 17:02:02 d1 ifup[263]: Got result done/Success for job 
> systemd-timesyncd.service
> Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Trying to enqueue 
> job systemd-timesyncd.service/try-restart/replace
> Jan 12 17:02:02 d1 systemd[1]: Added job systemd-timesyncd.service/nop to 
> transaction.
> Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Installed new job 
> systemd-timesyncd.service/nop as 155
> Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Enqueued job 
> systemd-timesyncd.service/nop as 155
> Jan 12 17:02:02 d1 systemd[1]: Pulling in systemd-timesyncd.service/start 
> from sysinit.target/start
> Jan 12 17:02:02 d1 systemd[1]: Added job systemd-timesyncd.service/start to 
> transaction.
> Jan 12 17:02:02 d1 systemd[1]: Pulling in system.slice/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in -.mount/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in time-sync.target/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in shutdown.target/stop from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in chrony.service/stop from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Added job chrony.service/stop to transaction.
> Jan 12 17:02:02 d1 systemd[1]: Found redundant job chrony.service/stop, 
> dropping from transaction.
> Jan 12 17:02:02 d1 systemd[1]: Garbage collecting job 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: systemd-timesyncd.service: Job 155 
> systemd-timesyncd.service/nop finished, result=done
> Jan 12 17:02:02 d1 systemd[1]: Pulling in systemd-timesyncd.service/start 
> from sysinit.target/start
> Jan 12 17:02:02 d1 systemd[1]: Added job systemd-timesyncd.service/start to 
> transaction.
> Jan 12 17:02:02 d1 systemd[1]: Pulling in system.slice/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in -.mount/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in time-sync.target/start from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in shutdown.target/stop from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Pulling in chrony.service/stop from 
> systemd-timesyncd.service/start
> Jan 12 17:02:02 d1 systemd[1]: Added job chrony.service/stop to transaction.
> Jan 12 17:02:02 d1 systemd[1]: Found redundant job chrony.service/stop, 
> dropping from transaction.
> Jan 12 17:02:02 d1 

Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-09 Thread wferi
Hi,

Don't you think #873057 is actually the same issue discovered with ntp?
Please consider merging it in if you agree.
-- 
Thanks,
Feri



Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-04 Thread Santiago Vila
On Fri, Jan 03, 2020 at 11:21:42PM +0100, Vincent Blut wrote:

> > Unfortunately, AFAIK, conflicts are bi-directional, so apparently the
> > problem will persist in buster as far as chrony still has conflicts
> > in the systemd unit file.
> 
> What do you mean by “conflicts are bi-directional”?

I meant that it is enough that service "A" declares a conflict with service "B"
so that both of them conflict with the other, i.e. they have
the "symmetric property" even if the conflicts is only declared once.

At least that's how I interpret the words "vice versa" here:

Conflicts=

A space-separated list of unit names. Configures negative
requirement dependencies. If a unit has a Conflicts= setting on
another unit, starting the former will stop the latter and vice versa.

So: If we tried to solve this in buster by dropping the use of
conflicts, and chrony still uses conflicts, maybe that would explain
that the problem is still there.

> Also, conflicting with systemd-timesyncd doesn’t seem to cause any issue on
> most systems (well, I hope ;-), so we should be cautious about incriminating
> “Conflicts= systemd-timesyncd.service” use as the root cause.

You are right. We should be cautious.

I really don't know which is the root cause, but I can reproduce this
failure in as many GCE machines as desired via cloning.

So I have just created two different machines for you and Michael to
see this by yourself (please check your private email for access details).

My suspicious is that this is some sort of race condition, but
I am not expert enough on systemd stuff to debug it.

Thanks.



Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-03 Thread Vincent Blut

Hi Santiago,

On 2020-01-02T13:38+0100, Santiago Vila wrote:

Package: chrony
Version: 3.4-4
Severity: important

Dear maintainer:

Apparently, installing chrony does not ensure at all that it will work.

Google has moved from ntp in Debian 9 to chrony in Debian 10 for their
default Debian GCE images, and I discovered this on a lot of GCE
instances having a clock several minutes off.

The problem I found is very similar to the one described here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=933370


Indeed, it sounds pretty similar.


I believe the best summary of the problem was made by Michael Biebl
here:

https://github.com/systemd/systemd/issues/7104#issuecomment-471329392

Quoting Michael:

As it stands, the current practice of having systemd-timesyncd.service
enabled by default (in Debian) and alternative implementations like
chrony or ntpd declare Conflicts=systemd-timesyncd.service in their
service file does not work reliably.



AFAIK, this has been fixed on the systemd side in version 241-3 by
dropping the "Conflicts" systemd had on chrony or ntpd.


Exact, the Debian systemd maintainers reintroduced the following drop-in 
file in version 241-3:


$ cat 
/lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf
[Unit]
ConditionFileIsExecutable=!/usr/sbin/ntpd
ConditionFileIsExecutable=!/usr/sbin/openntpd
ConditionFileIsExecutable=!/usr/sbin/chronyd
ConditionFileIsExecutable=!/usr/sbin/VBoxService

It prevents systemd-timesyncd from starting if one of the above
executables is present on the system.


Unfortunately, AFAIK, conflicts are bi-directional, so apparently the
problem will persist in buster as far as chrony still has conflicts
in the systemd unit file.


What do you mean by “conflicts are bi-directional”?

Also, conflicting with systemd-timesyncd doesn’t seem to cause any issue 
on most systems (well, I hope ;-), so we should be cautious about 
incriminating “Conflicts= systemd-timesyncd.service” use as the root 
cause.


Would you please tell me how things go when removing 
“ConditionFileIsExecutable=!/usr/sbin/chronyd” from the 
systemd-timesyncd service unit? Does that make chrony happy?



Thanks.


Cheers,
Vincent


signature.asc
Description: PGP signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-03 Thread Michael Biebl
Am 03.01.20 um 18:06 schrieb Santiago Vila:
> reassign 947936 chrony,systemd
> thanks
> 
> I am unsure at this point where is exactly the problem.
> 
> I have several test instances at GCE (running buster) where chrony
> does not start even if it's enabled.
> 
> By trial and error I managed to make chrony to start again
> properly if I do this:
> 
> journalctl --rotate
> init 6
> 
> So the natural question would be: What kind of bug in systemd can make
> it not to start a service at all when there is a "bad" journal?

Can you provide instructions how this issue can be reproduced?

systemd in buster (and testing/sid) ships
/lib/systemd/system/systemd-timesyncd.service.d/disable-with-time-daemon.conf
which will prevent systemd-timesyncd from being started if chrony is
installed.
I'm running chrony here (with systemd-timesyncd being enabled) and it
works as expected.





signature.asc
Description: OpenPGP digital signature


Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-03 Thread Santiago Vila
reassign 947936 chrony,systemd
thanks

I am unsure at this point where is exactly the problem.

I have several test instances at GCE (running buster) where chrony
does not start even if it's enabled.

By trial and error I managed to make chrony to start again
properly if I do this:

journalctl --rotate
init 6

So the natural question would be: What kind of bug in systemd can make
it not to start a service at all when there is a "bad" journal?

Thanks.



Bug#947936: chrony: Does (still) not start properly on boot on buster

2020-01-02 Thread Santiago Vila
Package: chrony
Version: 3.4-4
Severity: important

Dear maintainer:

Apparently, installing chrony does not ensure at all that it will work.

Google has moved from ntp in Debian 9 to chrony in Debian 10 for their
default Debian GCE images, and I discovered this on a lot of GCE
instances having a clock several minutes off.

The problem I found is very similar to the one described here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=933370

I believe the best summary of the problem was made by Michael Biebl
here:

https://github.com/systemd/systemd/issues/7104#issuecomment-471329392

Quoting Michael:
> As it stands, the current practice of having systemd-timesyncd.service
> enabled by default (in Debian) and alternative implementations like
> chrony or ntpd declare Conflicts=systemd-timesyncd.service in their
> service file does not work reliably.


AFAIK, this has been fixed on the systemd side in version 241-3 by
dropping the "Conflicts" systemd had on chrony or ntpd.

Unfortunately, AFAIK, conflicts are bi-directional, so apparently the
problem will persist in buster as far as chrony still has conflicts
in the systemd unit file.

I've noticed this problem happens randomly (it happens in some
instances, it does not happens in some others), so I don't have a
"recipe" as such to reproduce it.

However, I have a particular instance at GCE showing this behaviour
which I could try to clone to give you ssh access if required (please
contact me privately for details).

The behaviour is the following:

Both systemd-timesyncd and chrony are enabled (which is the default
on GCE instances). Just after a reboot, "systemctl status chrony" shows this:

● chrony.service - chrony, an NTP client/server
   Loaded: loaded (/lib/systemd/system/chrony.service; enabled; vendor preset: 
enabled)
   Active: inactive (dead)
   Docs: man:chronyd(8)
   man:chronyc(1)
   man:chrony.conf(5)

and I see this in the boot log:

systemd[1]: Condition check resulted in Network Time Synchronization being 
skipped.

If I do "systemctl disable systemd-timesyncd" and reboot, chrony is
properly loaded and it runs.

If I do "systemctl enable systemd-timesyncd" again and reboot, chrony
is shown as "inactive (dead)" again.

At this point, if I edit 
/etc/systemd/system/multi-user.target.wants/chrony.service
to remove the Conflicts line and reboot, chrony is properly loaded and it runs.

[ I'm reporting this as "important" because I believe it to be the kind
  of bug that should be fixed in a point release of Debian 10 ].

[ Cc to Michael Biebl in case he would like to comment on the issue ].

Thanks.