Re: [systemd-devel] list-boots is incorrect, was: lost journal persistence

Chris Murphy Thu, 15 May 2014 16:20:13 -0700

On May 15, 2014, at 3:55 PM, Lennart Poettering <lenn...@poettering.net> wrote:


> On Thu, 15.05.14 12:01, Chris Murphy (li...@colorremedies.com) wrote:
> 
>> 
>> 
>> On May 12, 2014, at 9:58 AM, Chris Murphy <li...@colorremedies.com> wrote:
>> 
>>> 
>>> On May 12, 2014, at 7:06 AM, Kirill Elagin <kirela...@gmail.com> wrote:
>>> 
>>>> Could it be that all the boot ids are actually the same for some reason?
>>>> I had this issue in a container when systemd was reading boot_id from 
>>>> `/proc/sys/kernel/random/boot_id` and since /proc was bind-mounted, 
>>>> boot_id always was host's boot_id.
>>>> 
>>>> You can also run `journalctl -F _BOOT_ID` to see a set of all the boot ids 
>>>> recorded in the journal (this must agree with `journalctl --list-boots`.
>>> 
>>> 
>>> # journalctl --list-boots | wc -l
>>> 36
>>> [root@rawhide ~]# journalctl -F _BOOT_ID | wc -l
>>> 80
>>> # cat /proc/sys/kernel/random/boot_id
>>> 420fa190-e7dd-4cd7-b248-fd62417d7c02
>>> # reboot
>>> ###
>>> # journalctl --list-boots | wc -l
>>> 36
>>> [root@rawhide ~]# journalctl -F _BOOT_ID | wc -l
>>> 81
>>> # cat /proc/sys/kernel/random/boot_id
>>> 1e0d5346-85cb-477b-9ae2-2cfb53097b97
>>> 
>>> 
>>> So there are more boot ID's than there are list-boots, and list-boots 
>>> doesn't increment while boot ID's do. And neither of these boot id's match 
>>> any of the boot id's in --list-boots.
>> 
>> Deleting the files in /var/log/journal/<machineid>/ fixes the problem on the 
>> Fedora 20 and Fedora Rawhide VMs experiencing this problem. If the files are 
>> restored, the problem returns. Yet --verify shows PASS for each log file. So 
>> it seems there's some kind of corruption or confusion with these log files.
>> 
>> The btrfs filesystem is mounted with autodefrag option, for which there have 
>> been some problems with kernel 3.12 and older. But the Rawhide system has 
>> only been using 3.14 and 3.15 kernels so if there's still some problem I'm 
>> not sure how to isolate it since journalctl --verify seems to think the 
>> files are OK.
>> 
>> Despite autodefrag enabled, the systemjournal has ~1450 fragments according 
>> to filefrag. If I boot from a rescue image so nothing is actively writing to 
>> these journal files, and recopy them such that they are each 1 extent only, 
>> then reboot the system, the list-boots changes. It actually goes down, from 
>> 45 boots to 31 boots, and stays stuck with additional reboots. Whereas 
>> BOOT_ID continues to increment as before. So just by copying the log files I 
>> get different --list-boot behavior. That's pretty strange.
>> 
>> Yes, of course also btrfs check and scrub find no problems with either 
>> Fedora 20 or Fedora Rawhide Btrfs file systems this problem is happening on.
> 
> This is certainly weird... Copying these files to ext4, does that change
> anything?

Each time I copy, no matter the file system, I get a different number of 
list-boots results each time. Yet each time the source and destination 
sha256sums for the journal files are the same. It makes no sense. Same 
sha256sums yet different --list-boot behavior. It's a non-deterministic hell it 
seems. Once the journals are affected by whatever is happening, they're 
permanently weird.

I think the state is sufficiently bizarre, with a huge pile of Btrfs patches 
applied in the 3.14 time frame (both systems ran 3.14 and 3.15 rc kernels) that 
this is just not worth looking at further. I'll keep the journals elsewhere, 
but I'm going to clear out /var/log/journal and let it start new ones from 
scratch and see how it goes, if the problem reoccurs.


> 
> I wonder if this might have to do something with time changes? Maybe the
> clock was not set correctly?

It's a VM on a laptop host. Time is always correct.

> 
> Maybe it is a perms issue with some files not accessible to some users?

They are all throw away test systems and I'm lazy so I only use root.

Chris Murphy



_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] list-boots is incorrect, was: lost journal persistence

Reply via email to