On Jul 30, 2009 08:29 +0200, Jakob Goldbach wrote: > I have a question on bug 9646 - Server went back in time. > > I had an OSS crash and had to pull the power. After mounting lustre > again I see the following on one of my clients: > > (import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went back in > time (transno 12901362807 was previously committed, server now claims > 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 > > This bug description suggest that there are commits lost in hardware > cache - but how can it loose all commits (transno is zero)? (btw, cache > is battery backup up) > > On the client that I saw this I had previosly deactivated the import > bacause of the crash. Is this the reason I'm seeing this transno as > zero ? (full dmesg below)
The error message is a bit misleading. Bug 9646 is describing the situation where the last_committed transaction number rolled back to some previous non-zero value. That indicates some serious problem in the storage. In this case the client is not getting a complete reply and is looking at a last_committed value that was never properly filled in. That should probably be updated in the manual. This should probably be filed as a bug, and no check should be done if the reply was an error, and no message printed on the console. > 2860:0:(import.c:508:import_select_connection()) > b-OST0010-osc-ffff81022ce89800: tried all connections, increasing > latency to 27s > > setting import backup-OST0010_UUID INACTIVE by administrator request > > 8281:0:(import.c:508:import_select_connection()) > b-OST0010-osc-ffff81022ce89800: tried all connections, increasing > latency to 32s > > 167-0: This client was evicted by b-OST0010; in progress operations > using this service will fail. > > b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010 > using nid 172.16.14...@tcp. > > 11-0: an error occurred while communicating with 172.16.14...@tcp. The > ost_statfs operation failed with -11 > ... > 11-0: an error occurred while communicating with 172.16.14...@tcp. The > obd_ping operation failed with -107 > > b-OST0010-osc-ffff81022ce89800: Connection to service backup-OST0010 via > nid 172.16.14...@tcp was lost; in progress operations using this service > will wait for recovery to complete. > > 2859:0:(import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went > back in time (transno 12901362807 was previously committed, server now > claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 > > 167-0: This client was evicted by backup-OST0010; in progress operations > using this service will fail. > > b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010 > using nid 172.16.14...@tcp. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
