RE: [OF-users] OOM crash

Dave Johnson Sun, 04 Feb 2007 10:50:49 -0800

I don't think I explained well enough, but I'm fairly confident that the 
"safe_strcpy_fn" error is red herring.  That single strcpy error may be another 
issue that can occur under low mem conditions but i don't understand how it can 
relate.  To clarify what I was doing:


-Attempt to copy 800GB from an OF system to another OF system via CIFS from an 
intermediate machine using Robocopy
-Target system is configured to take snapshots at the 0/24 hr
-Snapshots configured to allow up to 500GB of the 1TB vol
-Old OF system has 100MBit so migration expected to take some time
-Began copy mid evening
-Checked in about 36hrs later and the box system running the robocopy process 
shows very little network IO (total 1-2 MB/s total in/out but continuing to 
copy data)
-Robocopy log shows no unusual error
-Checked source OF system and it was functioning normally - access to files 
quick from robocopy management station
-Checked target system and share was very slow to respond
-Navigated to Netbios name of target system and it showed 2 shares, my "usr" 
share and the snapshot share as mentioned below
-I navigate to snapshot share from robocopy management station to and it 
responds several minutes later
-I attempt to copy a small file from the snapshot share to the local desktop 
and it doesn't appear to respond
-SSH into target system and obvious sluggishness, box is under heavy "disk 
wait" with load at 6+
-Attempt to run "ps auxww > process.log" from SSH session.
-Navigation to target system "usr" share from robocopy management station 
finally returns within Windows Explorer after 3-5 minutes
-I attempt to copy a small file into "usr" share
-The SSH session with the target system never returns from the ps command
-An error appears on-screen on the robocopy management station stating the 
error from my previous email (apparently truncated from message below, will try 
to find if needed but something to the affect "name to long") - CHANGE: I now 
suspect that due to the 1-2 minute delay in everything happening on the system, 
it could be that the error I received was due to my attempting to navigate to 
the snapshot share and trying to copy a file from there to the robocopy 
management station, then finally returning with the "name to long" error 
several minutes later and I suspect that at this point the box was in fact dead 
and the error was not due to my attempt to copy the small file to the "usr" 
share
-I connect a monitor to the target system and the out of memory (OOM) killer 
output is spewing to the screen about once every second or 2
-System is non-responsive to any input and disks are no longer being accessed 
by indication of HD activity lights
-I hard reset the system and when it comes back up it appears to the network 
and i can copy test files to/from it
-Check messages log and it is filled with LDAP errors and about the time of the 
final-finale there was that strcpy error
-emailed this list
-Decided to see if it was just a fluke so without clearing out the data on the 
target volume I started the robocopy session again with the /PURGE switch 
(robocopy   /S /E /COPY:DAT /DCOPY:T /NP /PURGE > migration.log) and opened an 
ssy session to monitor it.
-As before, the network traffic on the robocopy management system was solidly 
saturating the network at 80%/80% it's incoming/incoming rate of 100Mbit
-Several hours later I check in and I think I was lucky to catch the system 
more at the begining of this failure scenario... 
-"top" command on SSH session shows box now under load of around 4 with IOWait 
maxed out
-After a few minutes, IOWait would drop to normal rates and SMBD would then run 
at around 12-15% proc
-After a few minutes or less, IOWait would instantly rise to maximum again and 
all other processing load drop to zero
-Checking back at the robocopy station, this "on/off" copy scenario is evident 
by the simple Windows Taskmgr graph log of the network IO... dropping to zero 
for several minutes, then return to typical saturated 80Mbit for a short time

You can see the result a bit more clearly in these screenshots here:

http://www.se30.com/users/grindingbassline/pix/storage

Since you certainly understand your system more than I do, I'll try renaming 
the snaphot and let you know the result but I doubt it will have any impact. 
Since it has nothing to do with what I'm doing, except simply its mere 
existance on the system.

-=dave




----------------------------------------
> Date: Sat, 3 Feb 2007 13:35:14 +0000
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> CC: [email protected]
> Subject: Re: [OF-users] OOM crash
> 
> dave johnson wrote:
> > I checked the messages and found an asortment of ldap secrets errors 
> > (not sure about that) but the only entry i found that seemed suspect was:
> >  
> > Feb  1 22:28:38 hopper smbd[5714]: [2007/02/01 22:28:38, 0] 
> > lib/util_str.c:safe_strcpy_fn(603)
> > Feb  1 22:28:41 hopper smbd[5714]:   ERROR: string overflow by 1 (24 - 
> > 23) in safe_strcpy [snapshots.vg0.vol0.sched0.usr 2007-
> > 01-31 00.00.06
> 
> Rename that share in smb.conf to just [snapshots-] and 
> try again.
> 
> R.
> 
> > F
> >  
> > bringing back up, snapshot report shows:
> >  
> > *Snapshot name*     *Date/time taken*       *Block utilization (in MB)* 
> > *Snapshot size (in MB)*     *Share contents*        *Save*  *Delete 
> > snapshot*
> > sched0      January 31, 2007 00:00:06       181030  524288  Yes, do         
> > N/A     N/A
> >
> >  
> >  
> > want me to try again with a "ps auxww" process logging all output ?
> >  
> > -=dave
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Openfiler-users mailing list
> > [email protected]
> > https://lists.openfiler.com/mailman/listinfo/openfiler-users
> >   
> 
_______________________________________________
Openfiler-users mailing list
[email protected]
https://lists.openfiler.com/mailman/listinfo/openfiler-users

RE: [OF-users] OOM crash

Reply via email to