Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-12-09 Thread Sebastian Dalfuß

The Patch seems to be working.
My uptime records survived several crashes / fsck runs. (on ext2)



--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-12-09 Thread Thibaut VARENE
On Tue, Dec 9, 2008 at 11:16 AM, Sebastian Dalfuß [EMAIL PROTECTED] wrote:

 The Patch seems to be working.
 My uptime records survived several crashes / fsck runs. (on ext2)

Cool, good to know!

Radek, care to integrate it? :)

Cheers

-- 
Thibaut VARENE
http://www.parisc-linux.org/~varenet/


Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-10-11 Thread Thibaut VARENE
On Mon, Oct 6, 2008 at 6:38 PM, Raphael Manfredi
[EMAIL PROTECTED] wrote:
 Quoting Thibaut VARENE:
 : I'm thinking about something broader:
 : - keep the .old no matter what on each save
 : - when starting, if there's a .old and no records then use the .old
 : and print an alert message.

 This is really required.  After a power failure, my machine rebooted and
 fsck did the following:

I eventually took the time to cook a patch. If somebody can test it
and confirm it improves the situation, it'd be cool (I don't really
feel like crashing my machine on purpose :-)

It's a quick and basic implementation of the previous discussion: a
failsafe .old backup of the db will be kept at all times.

Radek, any opinion on this?

HTH

diff -u libuptimed/urec.c.orig libuptimed/urec.c
--- libuptimed/urec.c.orig  2007-10-31 11:23:38.0 +0100
+++ libuptimed/urec.c   2008-10-11 17:53:43.0 +0200
@@ -202,8 +202,14 @@
char buf[256], sys[SYSMAX+1];

f=fopen(FILE_RECORDS, r);
-   if (!f) return;
-   
+   if (!f) {
+   f=fopen(FILE_RECORDS.old, r);
+   if (!f)
+   return;
+   else
+   printf(uptimed: reading from backup database 
%s.old\n, FILE_RECORDS);
+   }
+
fgets(str, sizeof(str), f);
while (!feof(f)) {
/* Check for validity of input string. */
@@ -242,6 +248,7 @@
}
}
fclose(f);
+   rename(FILE_RECORDS, FILE_RECORDS.old);
rename(FILE_RECORDS.tmp, FILE_RECORDS);
 }



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#482643: unsafe update of /var/spool/uptimed/records, cause file losses

2008-10-09 Thread Christoph Martin
Hi,

to second this bug report:

I have uptimed running on about 50 machines and the described problem
appeared on a bunch of them. I lost a lot of uptimes there. A fix would
be very appreciated.

Christoph
-- 

Christoph Martin, EDV der Verwaltung, Uni-Mainz, Germany
 Internet-Mail:  [EMAIL PROTECTED]
  Telefon: +49-6131-3926337



signature.asc
Description: OpenPGP digital signature


Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-10-06 Thread Raphael Manfredi
Quoting Thibaut VARENE:
: I'm thinking about something broader:
: - keep the .old no matter what on each save
: - when starting, if there's a .old and no records then use the .old
: and print an alert message.

This is really required.  After a power failure, my machine rebooted and
fsck did the following:

/dev/md2: Entry 'records' in /spool/uptimed (58531) has deleted/unused inode 
58541.  CLEARED.

I just lost all my uptime records...

Raphael



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-05-24 Thread Raphael Manfredi
Package: uptimed
Version: 1:0.3.12-2
Severity: important

Apparently, uptimed is not updating its /var/spool/uptimed/records file
in a safe way: after a system crash, I have seen on many occasions fsck
clearing the corresponding inode and loosing all the uptime records.

The update procedure should be something like:

rename /var/spool/uptimed/records as /var/spool/uptimed/records.old
write new data to /var/spool/uptimed/records.new
fsync()
rename /var/spool/uptimed/records.new as /var/spool/uptimed/records
unlink /var/spool/uptimed/records.old

And upon startup, if a /var/spool/uptimed/records.old file is present
it should be used as the main database because it means the above procedure
was somehow interrupted by a crash.

To minimise disruption, I've increased the frequency of database savings
to 600 seconds on my systems.  Note that my machines do not crash frequently
but when they do, it is usually because one of the IDE disks loses an interrupt
and linux then hangs, requiring a reboot (the software watchdog is of no help
here, hangup seems to be at the kernel level).

I'm flagging this bug as important because it is somehow defeating the
purpose of having a tool record uptimes if a sudden crash causes years of
history to get trashed. (Data is not valuable enough to go through a tape
restore).

-- System Information:
Debian Release: testing/unstable
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: i386 (i686)

Kernel: Linux 2.6.25.4 (SMP w/2 CPU cores)
Locale: LANG=fr_FR, LC_CTYPE=fr_FR (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash

Versions of packages uptimed depends on:
ii  debconf [debconf-2.0]  1.5.22Debian configuration management sy
ii  libc6  2.7-2 GNU C Library: Shared libraries
ii  libuptimed01:0.3.11-1+b1 Library for uptimed

uptimed recommends no packages.

-- debconf information:
* uptimed/mail/do_mail: Both
* uptimed/mail/address: root
* uptimed/mail/milestones_info:
* uptimed/maxrecords: 50
* uptimed/interval: 600



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-05-24 Thread Thibaut VARENE
tags 482643 moreinfo unreproducible
thanks

On Sat, May 24, 2008 at 9:57 AM, Raphael Manfredi
[EMAIL PROTECTED] wrote:

 Apparently, uptimed is not updating its /var/spool/uptimed/records file
 in a safe way: after a system crash, I have seen on many occasions fsck
 clearing the corresponding inode and loosing all the uptime records.

 The update procedure should be something like:

rename /var/spool/uptimed/records as /var/spool/uptimed/records.old
write new data to /var/spool/uptimed/records.new
fsync()
rename /var/spool/uptimed/records.new as /var/spool/uptimed/records
unlink /var/spool/uptimed/records.old

It's pretty much what is done, save for the fsync():

void save_records(int max, time_t log_threshold) {
FILE *f;
Urec *u;
int i = 0;

f = fopen(FILE_RECORDS.tmp, w);
if (!f) {
printf(uptimed: cannot write to %s\n, FILE_RECORDS);
return;
}

for(u=urec_list; u; u = u-next) {
/* Ignore everything below the threshold */
if (u-utime = log_threshold) {
fprintf(f, %lu:%lu:%s\n, (unsigned
long)u-utime, (unsigned long)u-btime, u-sys);
/* Stop processing when we've logged the max
number specified. */
if ((max  0)  (++i = max)) break;
}
}
fclose(f);
rename(FILE_RECORDS.tmp, FILE_RECORDS);
}


In my opinion, you don't want a process running fsync() every 60s,
anyway. At least certainly not a subcritical process such as uptimed.

 And upon startup, if a /var/spool/uptimed/records.old file is present
 it should be used as the main database because it means the above procedure
 was somehow interrupted by a crash.

Keeping the old records db as a fallback is an idea worth
investigating, indeed. Cc'ing upstream maintainer.

 To minimise disruption, I've increased the frequency of database savings
 to 600 seconds on my systems.  Note that my machines do not crash frequently
 but when they do, it is usually because one of the IDE disks loses an 
 interrupt
 and linux then hangs, requiring a reboot (the software watchdog is of no help
 here, hangup seems to be at the kernel level).

Yeah I've been planning to increase the default frequency in a future
upload. 60s is insane.

 I'm flagging this bug as important because it is somehow defeating the
 purpose of having a tool record uptimes if a sudden crash causes years of
 history to get trashed. (Data is not valuable enough to go through a tape
 restore).

Agreed. What filesystem are you using on /var?

Thanks

T-Bone


-- 
Thibaut VARENE
http://www.parisc-linux.org/~varenet/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-05-24 Thread Raphael Manfredi
Hello Thibaut,

Quoting Thibaut VARENE:
: [EMAIL PROTECTED] wrote:
:  The update procedure should be something like:
: 
: rename /var/spool/uptimed/records as /var/spool/uptimed/records.old
: write new data to /var/spool/uptimed/records.new
: fsync()
: rename /var/spool/uptimed/records.new as /var/spool/uptimed/records
: unlink /var/spool/uptimed/records.old
: 
: It's pretty much what is done, save for the fsync():

Thanks for quoting the code.  But it's not really what it's doing:
It's writing to a uprecords.tmp and then renames that file to uprecords.
That's why there can be a dangling uprecords file that gets put in
lost+found once in a while...

My logic above first renames the old DB, writes a new record and only after
it has been renamed will the old file be unlinked.

: In my opinion, you don't want a process running fsync() every 60s,
: anyway. At least certainly not a subcritical process such as uptimed.

I concur. The fsync() is optional and not really needed on a journaling
filesystem.  Otherwise, fsync() is kind of needed to make sure the buffer
cache is flushed properly.  Naturally, as you noticed, the 60s is an
insane period for syncing! 600s should be the default (nobody will care
if the latest uptime misses 10 minutes!).  The DB should be flushed on
shutdown when uptimed exits anyway.

: Keeping the old records db as a fallback is an idea worth
: investigating, indeed. Cc'ing upstream maintainer.

What's important is that IF that .old file exists upon startup, then it means
something went wrong and therefore it should be used instead of the possibly
corrupted/missing uprecords file.

:  I'm flagging this bug as important because it is somehow defeating the
:  purpose of having a tool record uptimes if a sudden crash causes years of
:  history to get trashed.

: Agreed. What filesystem are you using on /var?

I use ext2 there.  I should probably convert it to ext3 but up to now,
I've kept using ext2 on /, /usr and /var to make sure they are fsck-ed
every time if not unmounted cleanly.  I reserve ext3 for data partitions
such as /home.  This decision was made a long time ago when ext3 was still
young.

Raphael



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#482643: unsafe update of /var/spool/uptimed/records cause file losses

2008-05-24 Thread Thibaut VARENE
On Sat, May 24, 2008 at 3:30 PM, Raphael Manfredi
[EMAIL PROTECTED] wrote:
 Hello Thibaut,

Hi,

 Thanks for quoting the code.  But it's not really what it's doing:
 It's writing to a uprecords.tmp and then renames that file to uprecords.
 That's why there can be a dangling uprecords file that gets put in
 lost+found once in a while...

Yeah I see what you mean.

 My logic above first renames the old DB, writes a new record and only after
 it has been renamed will the old file be unlinked.

There I don't follow. If your record db gets trashed because it hasn't
been flushed yet to the non-volatile media, moving the old db, writing
the new one, moving the new one and then unlinking the old one could
also gets foobar'd if you crash at the wrong time, eg immediately
after everything has been done:

the removal of the old one is done (inode marked clear), yet the new
data hasn't hit the media yet. You lose both files.

I'd keep the old database anyway, it doesn't hurt and provide a good
fallback in case something went awry.

 I concur. The fsync() is optional and not really needed on a journaling
 filesystem.  Otherwise, fsync() is kind of needed to make sure the buffer

If you really want to make sure that the data is safe, fsync() is
needed on a journaling fs too. A common misconception is to believe
that a journaling fs takes care of /data/. It doesn't. The only thing
it takes care of is /metadata/, ie making sure that there's no need to
run fsck to get the filesystem back into a consistent state. These are
two completely different things.

 cache is flushed properly.  Naturally, as you noticed, the 60s is an
 insane period for syncing! 600s should be the default (nobody will care
 if the latest uptime misses 10 minutes!).  The DB should be flushed on
 shutdown when uptimed exits anyway.

It is. See handler() in src/uptimed.c

 : Keeping the old records db as a fallback is an idea worth
 : investigating, indeed. Cc'ing upstream maintainer.

 What's important is that IF that .old file exists upon startup, then it means
 something went wrong and therefore it should be used instead of the possibly
 corrupted/missing uprecords file.

I'm thinking about something broader:
- keep the .old no matter what on each save
- when starting, if there's a .old and no records then use the .old
and print an alert message.

HTH

T-Bone

-- 
Thibaut VARENE
http://www.parisc-linux.org/~varenet/



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]