Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
The Patch seems to be working. My uptime records survived several crashes / fsck runs. (on ext2) -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
On Tue, Dec 9, 2008 at 11:16 AM, Sebastian Dalfuß [EMAIL PROTECTED] wrote: The Patch seems to be working. My uptime records survived several crashes / fsck runs. (on ext2) Cool, good to know! Radek, care to integrate it? :) Cheers -- Thibaut VARENE http://www.parisc-linux.org/~varenet/
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
On Mon, Oct 6, 2008 at 6:38 PM, Raphael Manfredi [EMAIL PROTECTED] wrote: Quoting Thibaut VARENE: : I'm thinking about something broader: : - keep the .old no matter what on each save : - when starting, if there's a .old and no records then use the .old : and print an alert message. This is really required. After a power failure, my machine rebooted and fsck did the following: I eventually took the time to cook a patch. If somebody can test it and confirm it improves the situation, it'd be cool (I don't really feel like crashing my machine on purpose :-) It's a quick and basic implementation of the previous discussion: a failsafe .old backup of the db will be kept at all times. Radek, any opinion on this? HTH diff -u libuptimed/urec.c.orig libuptimed/urec.c --- libuptimed/urec.c.orig 2007-10-31 11:23:38.0 +0100 +++ libuptimed/urec.c 2008-10-11 17:53:43.0 +0200 @@ -202,8 +202,14 @@ char buf[256], sys[SYSMAX+1]; f=fopen(FILE_RECORDS, r); - if (!f) return; - + if (!f) { + f=fopen(FILE_RECORDS.old, r); + if (!f) + return; + else + printf(uptimed: reading from backup database %s.old\n, FILE_RECORDS); + } + fgets(str, sizeof(str), f); while (!feof(f)) { /* Check for validity of input string. */ @@ -242,6 +248,7 @@ } } fclose(f); + rename(FILE_RECORDS, FILE_RECORDS.old); rename(FILE_RECORDS.tmp, FILE_RECORDS); } -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#482643: unsafe update of /var/spool/uptimed/records, cause file losses
Hi, to second this bug report: I have uptimed running on about 50 machines and the described problem appeared on a bunch of them. I lost a lot of uptimes there. A fix would be very appreciated. Christoph -- Christoph Martin, EDV der Verwaltung, Uni-Mainz, Germany Internet-Mail: [EMAIL PROTECTED] Telefon: +49-6131-3926337 signature.asc Description: OpenPGP digital signature
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
Quoting Thibaut VARENE: : I'm thinking about something broader: : - keep the .old no matter what on each save : - when starting, if there's a .old and no records then use the .old : and print an alert message. This is really required. After a power failure, my machine rebooted and fsck did the following: /dev/md2: Entry 'records' in /spool/uptimed (58531) has deleted/unused inode 58541. CLEARED. I just lost all my uptime records... Raphael -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
Package: uptimed Version: 1:0.3.12-2 Severity: important Apparently, uptimed is not updating its /var/spool/uptimed/records file in a safe way: after a system crash, I have seen on many occasions fsck clearing the corresponding inode and loosing all the uptime records. The update procedure should be something like: rename /var/spool/uptimed/records as /var/spool/uptimed/records.old write new data to /var/spool/uptimed/records.new fsync() rename /var/spool/uptimed/records.new as /var/spool/uptimed/records unlink /var/spool/uptimed/records.old And upon startup, if a /var/spool/uptimed/records.old file is present it should be used as the main database because it means the above procedure was somehow interrupted by a crash. To minimise disruption, I've increased the frequency of database savings to 600 seconds on my systems. Note that my machines do not crash frequently but when they do, it is usually because one of the IDE disks loses an interrupt and linux then hangs, requiring a reboot (the software watchdog is of no help here, hangup seems to be at the kernel level). I'm flagging this bug as important because it is somehow defeating the purpose of having a tool record uptimes if a sudden crash causes years of history to get trashed. (Data is not valuable enough to go through a tape restore). -- System Information: Debian Release: testing/unstable APT prefers unstable APT policy: (500, 'unstable') Architecture: i386 (i686) Kernel: Linux 2.6.25.4 (SMP w/2 CPU cores) Locale: LANG=fr_FR, LC_CTYPE=fr_FR (charmap=ISO-8859-1) Shell: /bin/sh linked to /bin/bash Versions of packages uptimed depends on: ii debconf [debconf-2.0] 1.5.22Debian configuration management sy ii libc6 2.7-2 GNU C Library: Shared libraries ii libuptimed01:0.3.11-1+b1 Library for uptimed uptimed recommends no packages. -- debconf information: * uptimed/mail/do_mail: Both * uptimed/mail/address: root * uptimed/mail/milestones_info: * uptimed/maxrecords: 50 * uptimed/interval: 600 -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
tags 482643 moreinfo unreproducible thanks On Sat, May 24, 2008 at 9:57 AM, Raphael Manfredi [EMAIL PROTECTED] wrote: Apparently, uptimed is not updating its /var/spool/uptimed/records file in a safe way: after a system crash, I have seen on many occasions fsck clearing the corresponding inode and loosing all the uptime records. The update procedure should be something like: rename /var/spool/uptimed/records as /var/spool/uptimed/records.old write new data to /var/spool/uptimed/records.new fsync() rename /var/spool/uptimed/records.new as /var/spool/uptimed/records unlink /var/spool/uptimed/records.old It's pretty much what is done, save for the fsync(): void save_records(int max, time_t log_threshold) { FILE *f; Urec *u; int i = 0; f = fopen(FILE_RECORDS.tmp, w); if (!f) { printf(uptimed: cannot write to %s\n, FILE_RECORDS); return; } for(u=urec_list; u; u = u-next) { /* Ignore everything below the threshold */ if (u-utime = log_threshold) { fprintf(f, %lu:%lu:%s\n, (unsigned long)u-utime, (unsigned long)u-btime, u-sys); /* Stop processing when we've logged the max number specified. */ if ((max 0) (++i = max)) break; } } fclose(f); rename(FILE_RECORDS.tmp, FILE_RECORDS); } In my opinion, you don't want a process running fsync() every 60s, anyway. At least certainly not a subcritical process such as uptimed. And upon startup, if a /var/spool/uptimed/records.old file is present it should be used as the main database because it means the above procedure was somehow interrupted by a crash. Keeping the old records db as a fallback is an idea worth investigating, indeed. Cc'ing upstream maintainer. To minimise disruption, I've increased the frequency of database savings to 600 seconds on my systems. Note that my machines do not crash frequently but when they do, it is usually because one of the IDE disks loses an interrupt and linux then hangs, requiring a reboot (the software watchdog is of no help here, hangup seems to be at the kernel level). Yeah I've been planning to increase the default frequency in a future upload. 60s is insane. I'm flagging this bug as important because it is somehow defeating the purpose of having a tool record uptimes if a sudden crash causes years of history to get trashed. (Data is not valuable enough to go through a tape restore). Agreed. What filesystem are you using on /var? Thanks T-Bone -- Thibaut VARENE http://www.parisc-linux.org/~varenet/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
Hello Thibaut, Quoting Thibaut VARENE: : [EMAIL PROTECTED] wrote: : The update procedure should be something like: : : rename /var/spool/uptimed/records as /var/spool/uptimed/records.old : write new data to /var/spool/uptimed/records.new : fsync() : rename /var/spool/uptimed/records.new as /var/spool/uptimed/records : unlink /var/spool/uptimed/records.old : : It's pretty much what is done, save for the fsync(): Thanks for quoting the code. But it's not really what it's doing: It's writing to a uprecords.tmp and then renames that file to uprecords. That's why there can be a dangling uprecords file that gets put in lost+found once in a while... My logic above first renames the old DB, writes a new record and only after it has been renamed will the old file be unlinked. : In my opinion, you don't want a process running fsync() every 60s, : anyway. At least certainly not a subcritical process such as uptimed. I concur. The fsync() is optional and not really needed on a journaling filesystem. Otherwise, fsync() is kind of needed to make sure the buffer cache is flushed properly. Naturally, as you noticed, the 60s is an insane period for syncing! 600s should be the default (nobody will care if the latest uptime misses 10 minutes!). The DB should be flushed on shutdown when uptimed exits anyway. : Keeping the old records db as a fallback is an idea worth : investigating, indeed. Cc'ing upstream maintainer. What's important is that IF that .old file exists upon startup, then it means something went wrong and therefore it should be used instead of the possibly corrupted/missing uprecords file. : I'm flagging this bug as important because it is somehow defeating the : purpose of having a tool record uptimes if a sudden crash causes years of : history to get trashed. : Agreed. What filesystem are you using on /var? I use ext2 there. I should probably convert it to ext3 but up to now, I've kept using ext2 on /, /usr and /var to make sure they are fsck-ed every time if not unmounted cleanly. I reserve ext3 for data partitions such as /home. This decision was made a long time ago when ext3 was still young. Raphael -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses
On Sat, May 24, 2008 at 3:30 PM, Raphael Manfredi [EMAIL PROTECTED] wrote: Hello Thibaut, Hi, Thanks for quoting the code. But it's not really what it's doing: It's writing to a uprecords.tmp and then renames that file to uprecords. That's why there can be a dangling uprecords file that gets put in lost+found once in a while... Yeah I see what you mean. My logic above first renames the old DB, writes a new record and only after it has been renamed will the old file be unlinked. There I don't follow. If your record db gets trashed because it hasn't been flushed yet to the non-volatile media, moving the old db, writing the new one, moving the new one and then unlinking the old one could also gets foobar'd if you crash at the wrong time, eg immediately after everything has been done: the removal of the old one is done (inode marked clear), yet the new data hasn't hit the media yet. You lose both files. I'd keep the old database anyway, it doesn't hurt and provide a good fallback in case something went awry. I concur. The fsync() is optional and not really needed on a journaling filesystem. Otherwise, fsync() is kind of needed to make sure the buffer If you really want to make sure that the data is safe, fsync() is needed on a journaling fs too. A common misconception is to believe that a journaling fs takes care of /data/. It doesn't. The only thing it takes care of is /metadata/, ie making sure that there's no need to run fsck to get the filesystem back into a consistent state. These are two completely different things. cache is flushed properly. Naturally, as you noticed, the 60s is an insane period for syncing! 600s should be the default (nobody will care if the latest uptime misses 10 minutes!). The DB should be flushed on shutdown when uptimed exits anyway. It is. See handler() in src/uptimed.c : Keeping the old records db as a fallback is an idea worth : investigating, indeed. Cc'ing upstream maintainer. What's important is that IF that .old file exists upon startup, then it means something went wrong and therefore it should be used instead of the possibly corrupted/missing uprecords file. I'm thinking about something broader: - keep the .old no matter what on each save - when starting, if there's a .old and no records then use the .old and print an alert message. HTH T-Bone -- Thibaut VARENE http://www.parisc-linux.org/~varenet/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]