You might want to sign up an account to read the comments (not sure if they are really helpful), but in the problem description, the person mentions stopping the DFS Service to stabilize the box.
" Over the last couple of months our Poweredge server would hang the only response we would get from it was a ping we would have to give it a cold start. We disabled the replication to the second dfs server but this didnt help. We have now stopped the dfs service and disabled it on the box (dfs1) for the last two days and it has been stable." It could still be unrelated to what you're seeing though. If stopping replication or DFS solves the problem, I'd be on the horn to PSS (and maybe sooner if there are still no leads). -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 11:10 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Sorry I didn't make that clear, when this started we were really thinking it was a firewall problem and it morphed over to a server problem rather slowly. The DFS Replication logs show an error every few weeks about a file that cannot be replicated due to consistent sharing violations, but normally all I see are the informational 'a file was changed on multiple servers and a conflict resolution algorithm was used to determine the winning file.' The data/time on the sharing violations do not match anywhere close to the date/time of the current outages we are seeing. We have gone over each documented outage time and looked through all the log files for anything close to the outages and found nothing recorded within five minutes of any outage. I am going to have DFS Replication turned off by Monday. Bonnie, certainly you're saying 'DFS Replication' had to be turned off, not 'DFS Namespace' entirely??? -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 11:42 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Although you mentioned DFS, this is the first mention I've seen of replication--that could be causing an obscure problem, and it does usually happen on a schedule like what you're seeing. This sounds a lot like what you are talking about: http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Server/2003_S erver/Q_22791394.html Looks like s/he had to disable the DFS Service altogether to get the problem to quit. Are you seeing anything in the DFS Replication event logs? I wonder if there's a way to turn up the logging on the service... -Bonnie -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 4:59 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Thanks for playing, yes we upgraded the SATA HD firmware as well, in all we had two updates that required an external boot and a manual install process at a DOS prompt, they each went smooth. If you've been playing along, thanks Bonnie, you may remember I've got two PE2950 that are both file servers, nothing else, they each are Windows 2003 Server R2 running sharing files via MS DFS and using DFS Replication (the new R2 version, not the older File Replication Service) to keep the files in sync as well as file Quotas using File Server Resource Manager (FSRM). Virtually nothing else is running on these, except of course Symantec Antivirus Corporate Edition 10.1.5.5010 with tamper protection turned off as we have seen problems with tamper protection in prior versions. As part of our diagnostics we did disable Symantec Antivirus for several days and that did not help the problem at all. So, even though the DFS Replication diagnostic reports have been telling us that there are no errors nor warnings we are finding that replication is not actually happening a good bit of the time! As we attempt to migrate users to the failover file server we find via tools like Microsoft SyncToy and 2BrightSparks SyncBack that files are not actually replicated 100%. Out of about one million files spread across 10 different Replication groups that two of the replication groups have missed about 1000 files, so replication normally works, but at times it's having a bit of difficulty. Once I can get all the users pointed to a single file server I plan to disable the DFS Replication to see if the outage times stop. Right now, I'm seeing that both file servers are actually having problems; as we have a diagnostic application running on the system partition of each file server appending a text file on the data partitions every five seconds. At a variety of times on no apparent time table the application cannot append the text file on the data partition, although at a time table that is a bit predictable, about every six hours it seems to get real bad only on one file server though, the older of the two PE2950's that has a slower processor. The Performance Monitor tells us that the CPU is spiking to over 100% for 4.5 minutes every six hours. Most outages are roughly 10 to 20 seconds. I should know more next week after we migrate the rest of the data off the problematic server this weekend. Hopefully we won't be migrating the _problem_ with it! -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 05, 2008 10:55 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? It sounds like you are all updated firmware/driver-wise with the RAID controller, bios, etc--have you or they tried installing the latest SAS (or SATA) HD Firmware yet? You have to get the utility to make an ISO or cd and boot from that to update the drives. I've only updated one SAS 2950 server so far, which was in the process of being built/installed from scratch--haven't done any "live" systems--but the one I did went fine. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 29, 2008 9:41 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? After extended discussions with Dell, I'm really starting to wonder if this is a hardware issue at all. If you're familiar with Dell's DSET utility you'll know that it is able to capture logs from many areas of both hardware and software related items. They have gone over the log files several times and seen periods where the log files do not capture any information during the "outage" but in no place does any log file capture a problem. While logged into the console of the problematic server Windows Explorer seems to go into a Non Responding period of approximately four minutes. The task manager, running prior to the outage is 'frozen' during the outage so no new tasks nor updates on existing tasks is visible. Running Performance Monitor on the server during the outage freezes while the outage is happening, so it is not possible to see anything on screen while the problem happens. I was able to capture a log file of the Performance Monitor and send it to Dell for analysis, but they could not see any problems and have asked for another Performance Monitor capture. What else could cause Windows Explorer to lock up 'every so often.' It is usually Approximately 1 AM, 7 AM, 1 PM and 7 PM, or up to 40 minutes after each of those time frames. Twice now I have seen explorer windows lock up on ONE VOLUME only, and twice I've seen Windows Explorer lock up entirely, on both volumes. This server is relatively new, was purchased as a file server, no other roles are active, nothing unnecessary was installed, not Web server, nothing. The only ports open to the file server via an external hardware firewall are those ports required for File/Print sharing. (139/TCP, 445/TCP, 137/UDP and 138/UDP.) -----Original Message----- From: Tom Miller [mailto:[EMAIL PROTECTED] Sent: Thursday, July 24, 2008 12:21 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Weird. I had a similar problem a month ago on a 2950. The PERC went unresponsive. When I finally got the server back I had lost all my data. That was not a fun day. I was current with patches (Netware) and firmware/bios updates. >>> "Stephen Wimberly" <[EMAIL PROTECTED]> 7/24/2008 12:06 PM >>> Here is a twist! Today I was connected to the console of the file server at the very moment the problem occurred. The problem seems to be the drive array, as the System volume responded just fine during the outage, but the internal RAID 5 drive array went to a non-responding state for FOUR MINUTES! I have opened a ticket with Dell, as it's a Dell PowerEdge 2950 server which is fully under warranty. The tech that answered did not see anything wrong in the DSET report, and has escalated the issue to a supervisor. So I think our Network guys are right, it's not a network issue, it's inside the box. This is a fairly new server, which runs as a file server only, no other roles are installed, so it 'should' be fairly easy to diagnose. At the time of the problem, all windows explorer windows showing anything on the RAID5 array go dormant with Not Responding at the top. Any windows explorer window displaying something on the system volume responds as normal, where I am able to open and close files, modify and save modified files, etc. The taskbar also goes dormant where it does not respond to any clicking. When the server returned to normal it very quickly processed all the clicks I had done to switch windows, just flashing on the screen rather quickly as though it had been storing my mouse clicks. The event logs don't record anything during nor after the problem. The next entries in the App, Security, system logs are well after it started to respond and have nothing to do with 'anything'. So now I await a return call from Dell. Thought I'd provide a follow up since several of you have sent me messages on what to look for! Thanks again! -----Original Message----- From: Kim Longenbaugh [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 3:49 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Have the network guys look at the flow-control settings on your switches. If flow-control is on (as it should be in most cases), ports may be getting overwhelmed with traffic, resulting in pause frames. Flow-control pausing a connection will not result in tcp retransmits. Also, some switches may run out of buffer for the paused frames, although that condition would cause you to start seeing tcp retransmits. Some switches allow broadcast and unicast throttling. If they're turned on, they may be shutting down connections until the traffic goes below the thresholds again. An obvious thing is the speed/duplex settings. If there's a mismatch, the resulting degradation may only become noticeable under heavy traffic loads. Can you identify the source and destination for the SMB traffic? If so, you could try to find what's causing it. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 22, 2008 2:16 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? This just gets more fun... Our network team came out to our building to perform an on-site network sniff. There are no TCP retries, so there are no lost packets. Follow that with the statement There is a lot of SMB traffic, and SMB wouldn't attempt a resend, so there might be some network lost packets. He has taken the network traffic to research SMB traffic. In the meantime, we find that some machines drop connection at the same time that other machines don't. We have a test script running on several machines which append a text file every fifteen seconds and records failures. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Thursday, July 17, 2008 8:24 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? When we ping the file server and any server in the same network a 'normal' reply would be either =1 ms or =2 ms. At the time of these problems we are getting well over 100 ms for approximately two minutes! Our network department has looked at wireshark traces from both workstation and server and has merely pointed out that there is SMB traffic happening at the time of the problem. (I would think that to be rather 'normal' when you run an application from a file share.) I asked why they brought it up, whether it is unusual, they said that they did not know and would need to do more research. So now we are waiting on them to review more log files. -----Original Message----- From: Terry Dickson [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 2:45 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? So have you tried something simple like a Ping to that server to see if the Pings timeout, or are slower at the time of the slowdowns? Just might help to figure out if it is network related or not. -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 1:34 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? We will "un-team" in the next couple of days as a test; but keep in mind the SQL Server is teamed using the same NICs as well with no issues, that's why it hasn't been suspect yet. I'm going to look into the firmware tomorrow morning when we have scheduled downtime, thanks for mentioning. As for Software firewall; we normally run the Windows firewall, but turned that off for testing with no change. The problem occurred again today at 1:15 PM. It seems that Windows Explorer 'freezes' on almost all domain computers and no one can access their file shares for a few seconds, until a reconnect can be established. One diagnostic script we have running appends a text file on the server every 15 seconds and during the outage could not append for a full five minutes! Network ports are not ours to swap, but our network team. Once they give the word we could try that. There are hardware firewalls at play as well; the firewall team is looking into those to determine possible issues with load balancing, etc. Thanks for your suggestions! -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 1:42 PM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Hmm.. sounds like it's already been set then, but I don't know as I've always done both the reg entry and the RSS on the Bcom NIC itself. We also are not using teaming at the moment, so I don't know if that might have a separate issue. Just re-read your post. I see you mentioned all drivers updated, but how about firmware? Are you able to swap a network port the file server is using with the SQL server that works? What else is running on your file servers that is the same across both--any software firewalls? -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 8:23 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? All the registry entries are as you have them.... Although; my "Broadcom BCM5708C NetXtreme II GigE" cards were set to ENABLE 'Receive Side Scaling'. I changed them to 'Disable'. Each card disabled for a moment, then auto re-enabled; so I assume this does not need a restart. These servers have teamed NICs; all our servers do. The BACS (BroadCom Advanced Control Suite) is set up for switch failover as each NIC is physically plugged to a different switch for failover. -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 10:29 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? They're in the same area of the registry--My .reg file that I import looks like this: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters] "EnableTCPA"=dword:00000000 "EnableRSS"=dword:00000000 "EnableTCPChimney"=dword:00000000 Also, on the Broadcom NIC(s) properties, look at the advanced tab. Make sure "Receive Side Scaling" is set to Disable. I haven't done the netsh method, but I understand that can change it w/out needing a server reboot. -Bonnie -----Original Message----- From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 7:23 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Thanks Bonnie! The TCP Chimney options are off! (I had to look, @ HKLM\System\CurrentControlSet\Services\Tcpip\Parapeters\EnableTCPChimney =0 I've never configured them either way!) The SNP I don't know how to check. I see where I can use a netsh to set it to disabled, but how would I see its current state? -----Original Message----- From: Miller Bonnie L. [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 8:56 AM To: NT System Admin Issues Subject: RE: Disconnected on a schedule??? Any kind of backup or snapshot taking place at those times? Although I can't say this would happen like clockwork, have you already disabled the Chimney/SNP network options on those servers? -Bonnie From: Stephen Wimberly [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 16, 2008 5:51 AM To: NT System Admin Issues Subject: Disconnected on a schedule??? We have workstations that appear to be losing connection to the file share on the server at almost precise times, every six hours. 7 AM, 1 PM, 7 PM, 1 AM; Repeat. The event logs on the workstation and servers are clean, Domain controllers and file share server. So I assume the loss is not long enough for the OS to recognize it. Although we have a custom application running on many machines that can't seem to handle the brief outage and fails like clockwork. The application vendor tells us it has a sixty second timeout before it will fail; certainly long enough to handle any brief disconnect. Network traces (using wireshark) from the server to workstation and workstation to server do not show any sign of failure. A script that updates a text file on the server every fifteen seconds does show the failure, it fails to update the text file on the server for up to four _minutes_ at a time! Although during the four minute failure period it's able to update once or twice during the outage, so it's not a total blackout. Workstations map a drive to the file share using a DFS path; ie: \\domain\share <file:///\\domain\share> . So we tested a direct mapping using \\server\share <file:///\\server\share> , and we get the same result. We mapped drives to two different file servers, each file server is in a different building on different ends of campus. The workstations used four test drive mappings, two for each server, one DFS on each server and one direct for each server. All four drive mappings failed at the same time. The connection to the SQL server is never lost. The SQL server is plugged into the same network switch as the file server. The Windows Domain has no trusts; it's a single domain forest. There are no services on any server with a six hour schedule that we know of. Backup runs daily at midnight and completes prior to 7 AM. Virus scan is still running at the 7 AM hour, but is long since complete by the 1 PM hour. Both file servers are Dell PE 2950 running Windows Server 2003 R2; All drivers seem up to date with Dell's support site. Workstations are a variety of makes, running either Windows XP Pro SP2, Windows XP Pro SP3 and Windows Vista SP1 and are scattered all over campus on different network subnets. Our network department is telling us that the network is fine, it's either a workstation or a server issue. Anyone seen this type of thing before??? Thanks! Confidentiality Notice: This e-mail message, including attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. gNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~ ~ Upgrade to Next Generation Antispam/Antivirus with Ninja! ~ ~ <http://www.sunbelt-software.com/SunbeltMessagingNinja.cfm> ~
