Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.

2011-03-14 Thread Shen, Xin (Sinux)
Update the kernel will probably be the way to fix your problem.

Best Regards
Sinux

在 2011-3-12,10:08,Ross Walker rswwal...@gmail.com 写道:

 On Mar 11, 2011, at 12:33 PM, PJ pauljer...@gmail.com wrote:
 
 This may or may not be CentOS related, but am out of ideas at this
 point and wanted to bounce this off the list.
 
 I'm running a CentOS 5.5 server, running the latest kernel 
 2.6.18-194.32.1.el5.
 
 Almost everyday around 3:30 AM the server completely locks up and has
 to be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I
 love being an internet janitor! :)
 
 Smells like a hardware issue to me too, but I went through all of the
 dell diagnostics, updated the firmware, everything checks out as being
 okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
 tech. No hardware issues, at least that we were able to find.
 
 There are no cron jobs that run at 3:30, no backups, the server has a
 load of 0, nothing is scheduled around that time...
 
 The only crontab entry at all is */5 * * * * wget -q
 www.websitedomain.com/cron.php /dev/null 21
 They are running Magento for commerce purposes and this runs every 5 minutes.
 
 Why does the server only lockup around 3:30 AM? Because it's knows I
 am fast asleep?
 
 I was able to pull this from /var/log/messages, this happens just
 seconds before locking up completely...
 
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
 than 120 seconds.
 Mar  8 03:33:19 web1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar  8 03:33:19 web1 kernel: wget  D 810001004420 0
 13608  13607 (NOTLB)
 Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
 81007bc7bd88 81000100d3f8
 Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
 8100849db0c0 80308b60
 Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
 8100849db2a8 64c82eae
 Mar  8 03:33:19 web1 kernel: Call Trace:
 Mar  8 03:33:20 web1 kernel:  [80063c6f]
 __mutex_lock_slowpath+0x60/0x9b
 Mar  8 03:33:20 web1 kernel:  [80063cb9] .text.lock.mutex+0xf/0x14
 Mar  8 03:33:20 web1 kernel:  [8000cf82] do_lookup+0x90/0x1e6
 Mar  8 03:33:20 web1 kernel:  [8000a29c] 
 __link_path_walk+0xa01/0xf5b
 Mar  8 03:33:20 web1 kernel:  [8000ea4b] link_path_walk+0x42/0xb2
 Mar  8 03:33:20 web1 kernel:  [8000cd72] do_path_lookup+0x275/0x2f1
 Mar  8 03:33:23 web1 kernel:  [80012851] getname+0x15b/0x1c2
 Mar  8 03:33:23 web1 kernel:  [800239d1] __user_walk_fd+0x37/0x4c
 Mar  8 03:33:23 web1 kernel:  [80028905] vfs_stat_fd+0x1b/0x4a
 Mar  8 03:33:23 web1 kernel:  [80023703] sys_newstat+0x19/0x31
 Mar  8 03:33:23 web1 kernel:  [8005d116] system_call+0x7e/0x83
 
 If anyone has some advice on where to go from here it would be greatly
 appreciated.
 
 Do a fsck of the file system wget is writing to as there might be a 
 corruption it hits only on the 3:30am run as that's when the other vendor 
 dumps data to be downloaded.
 
 You could also check to see if a RAID  patrol read (scrub/predictive failure 
 detection) is happening around this time as well and disable/reschedule it.
 
 -Ross
 
 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos
 
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-14 Thread Scott Silva
snip
 
 (those commands should run the cron normally except starting at 03:00,
 where it should kick off at 3:15 instead of 03:05)
 
 If it also fails to start at 03:15 then that would suggest that
 something is happening to the cron job the last time it is run to make
 it hang (or make the system think it is hung).
 
I wonder if the other end has some file(s) locked for writing or other reason
and that is choking the local wget.

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-14 Thread Les Mikesell
On 3/14/2011 3:24 PM, Scott Silva wrote:
 snip

 (those commands should run the cron normally except starting at 03:00,
 where it should kick off at 3:15 instead of 03:05)

 If it also fails to start at 03:15 then that would suggest that
 something is happening to the cron job the last time it is run to make
 it hang (or make the system think it is hung).

 I wonder if the other end has some file(s) locked for writing or other reason
 and that is choking the local wget.

But nothing the other end does should cause a kernel task hang. 
Shouldn't that only be inside a device driver?

-- 
   Les Mikesell
lesmikes...@gmail.com
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.

2011-03-12 Thread Alexander Georgiev

 Almost everyday around 3:30 AM the server completely locks up and has
 to be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I
 love being an internet janitor! :)

 Smells like a hardware issue to me too, but I went through all of the
 dell diagnostics, updated the firmware, everything checks out as being
 okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
 tech. No hardware issues, at least that we were able to find.

 There are no cron jobs that run at 3:30, no backups, the server has a
 load of 0, nothing is scheduled around that time...

do you have smartd set to run short/long hard disk checks during the
night? it is done via /etc/smartd.conf, not via cron.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-12 Thread Johnny Hughes
On 03/11/2011 04:06 PM, PJ wrote:

snip

 Interesting entries in /var/log/cron:
 
 -snip-
 (this runs 24/7 every 5 minutes as normal...)
 
 Mar 11 02:20:01 web1 crond[12919]: (webuser) CMD (wget -q
 www.domain.com/cron.php /dev/null 21)

snip

 (fast forward to 3 AM, the same cron job starts getting delayed by
 3:27 the server was non responsive. Never seen this before)
 
 Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts /etc/cron.hourly)
 Mar 11 03:07:20 web1 crond[13727]: (webuser) error: Job execution of
 per-minute job scheduled for 03:05 delayed into subsequent minute
 03:07. Skipping job run.
 Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set
 security context

snip

 I don't think it is a coincidence I'm seeing CRON (webuser) ERROR:
 cannot set security context around the same time the server stops
 responding.
 
 I'm not familiar with this message, anyone here seen it?
 
 cron daily fires off at 4:02, after all this stuff...
 
 nothing in cron.hourly..
 
 Getting warmer I think, but still cant figure it out!

OK, did the webuser job run at 03:00:01 or does it start at 03:05:01?

The system likely thinks that cron job is still running from the last
time it was initiated. (Be it at 03:00:01 or 02:55:01).

Is there anything that the php file called by the cron (via wget) is
supposed to do at 03:00 that is different than other times?

You might try something like this:

*/5 0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 * * *
current_cron_command

15,30,45 3  * * * current_cron_comamnd

(those commands should run the cron normally except starting at 03:00,
where it should kick off at 3:15 instead of 03:05)

If it also fails to start at 03:15 then that would suggest that
something is happening to the cron job the last time it is run to make
it hang (or make the system think it is hung).



signature.asc
Description: OpenPGP digital signature
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


[CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.

2011-03-11 Thread PJ
This may or may not be CentOS related, but am out of ideas at this
point and wanted to bounce this off the list.

I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5.

Almost everyday around 3:30 AM the server completely locks up and has
to be power cycled before it will come back online.
(this means someone hat to wake up and reboot the server, oh how I
love being an internet janitor! :)

Smells like a hardware issue to me too, but I went through all of the
dell diagnostics, updated the firmware, everything checks out as being
okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
tech. No hardware issues, at least that we were able to find.

There are no cron jobs that run at 3:30, no backups, the server has a
load of 0, nothing is scheduled around that time...

The only crontab entry at all is */5 * * * * wget -q
www.websitedomain.com/cron.php /dev/null 21
They are running Magento for commerce purposes and this runs every 5 minutes.

Why does the server only lockup around 3:30 AM? Because it's knows I
am fast asleep?

I was able to pull this from /var/log/messages, this happens just
seconds before locking up completely...

Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
than 120 seconds.
Mar  8 03:33:19 web1 kernel: echo 0 
/proc/sys/kernel/hung_task_timeout_secs disables this message.
Mar  8 03:33:19 web1 kernel: wget  D 810001004420 0
13608  13607 (NOTLB)
Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
81007bc7bd88 81000100d3f8
Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
8100849db0c0 80308b60
Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
8100849db2a8 64c82eae
Mar  8 03:33:19 web1 kernel: Call Trace:
Mar  8 03:33:20 web1 kernel:  [80063c6f]
__mutex_lock_slowpath+0x60/0x9b
Mar  8 03:33:20 web1 kernel:  [80063cb9] .text.lock.mutex+0xf/0x14
Mar  8 03:33:20 web1 kernel:  [8000cf82] do_lookup+0x90/0x1e6
Mar  8 03:33:20 web1 kernel:  [8000a29c] __link_path_walk+0xa01/0xf5b
Mar  8 03:33:20 web1 kernel:  [8000ea4b] link_path_walk+0x42/0xb2
Mar  8 03:33:20 web1 kernel:  [8000cd72] do_path_lookup+0x275/0x2f1
Mar  8 03:33:23 web1 kernel:  [80012851] getname+0x15b/0x1c2
Mar  8 03:33:23 web1 kernel:  [800239d1] __user_walk_fd+0x37/0x4c
Mar  8 03:33:23 web1 kernel:  [80028905] vfs_stat_fd+0x1b/0x4a
Mar  8 03:33:23 web1 kernel:  [80023703] sys_newstat+0x19/0x31
Mar  8 03:33:23 web1 kernel:  [8005d116] system_call+0x7e/0x83

If anyone has some advice on where to go from here it would be greatly
appreciated.

Thanks in advance.

--
PJF
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.

2011-03-11 Thread Boris Epstein
On Fri, Mar 11, 2011 at 12:33 PM, PJ pauljer...@gmail.com wrote:
 This may or may not be CentOS related, but am out of ideas at this
 point and wanted to bounce this off the list.

 I'm running a CentOS 5.5 server, running the latest kernel 
 2.6.18-194.32.1.el5.

 Almost everyday around 3:30 AM the server completely locks up and has
 to be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I
 love being an internet janitor! :)

 Smells like a hardware issue to me too, but I went through all of the
 dell diagnostics, updated the firmware, everything checks out as being
 okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
 tech. No hardware issues, at least that we were able to find.

 There are no cron jobs that run at 3:30, no backups, the server has a
 load of 0, nothing is scheduled around that time...

 The only crontab entry at all is */5 * * * * wget -q
 www.websitedomain.com/cron.php /dev/null 21
 They are running Magento for commerce purposes and this runs every 5 minutes.

 Why does the server only lockup around 3:30 AM? Because it's knows I
 am fast asleep?

 I was able to pull this from /var/log/messages, this happens just
 seconds before locking up completely...

 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
 than 120 seconds.
 Mar  8 03:33:19 web1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar  8 03:33:19 web1 kernel: wget          D 810001004420     0
 13608  13607                     (NOTLB)
 Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
 81007bc7bd88 81000100d3f8
 Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
 8100849db0c0 80308b60
 Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
 8100849db2a8 64c82eae
 Mar  8 03:33:19 web1 kernel: Call Trace:
 Mar  8 03:33:20 web1 kernel:  [80063c6f]
 __mutex_lock_slowpath+0x60/0x9b
 Mar  8 03:33:20 web1 kernel:  [80063cb9] .text.lock.mutex+0xf/0x14
 Mar  8 03:33:20 web1 kernel:  [8000cf82] do_lookup+0x90/0x1e6
 Mar  8 03:33:20 web1 kernel:  [8000a29c] 
 __link_path_walk+0xa01/0xf5b
 Mar  8 03:33:20 web1 kernel:  [8000ea4b] link_path_walk+0x42/0xb2
 Mar  8 03:33:20 web1 kernel:  [8000cd72] do_path_lookup+0x275/0x2f1
 Mar  8 03:33:23 web1 kernel:  [80012851] getname+0x15b/0x1c2
 Mar  8 03:33:23 web1 kernel:  [800239d1] __user_walk_fd+0x37/0x4c
 Mar  8 03:33:23 web1 kernel:  [80028905] vfs_stat_fd+0x1b/0x4a
 Mar  8 03:33:23 web1 kernel:  [80023703] sys_newstat+0x19/0x31
 Mar  8 03:33:23 web1 kernel:  [8005d116] system_call+0x7e/0x83

 If anyone has some advice on where to go from here it would be greatly
 appreciated.

 Thanks in advance.

 --
 PJF
 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos


Have you tried disabling the cron job you think is at fault to see if
the lock up goes away? Also, have you checked all the users' crontabs?

Boris.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread m . roth
PJ wrote:
 This may or may not be CentOS related, but am out of ideas at this point
and wanted to bounce this off the list.

 I'm running a CentOS 5.5 server, running the latest kernel
 2.6.18-194.32.1.el5.

 Almost everyday around 3:30 AM the server completely locks up and has to
be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I love
being an internet janitor! :)

Please log of the Internet. We are cleaning it. You may log back on later.

snip
 I was able to pull this from /var/log/messages, this happens just
seconds before locking up completely...

 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than
120 seconds.
 Mar  8 03:33:19 web1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar  8 03:33:19 web1 kernel: wget  D 810001004420 0
13608  13607 (NOTLB)
 Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
 81007bc7bd88 81000100d3f8
 Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
 8100849db0c0 80308b60
 Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
 8100849db2a8 64c82eae
 Mar  8 03:33:19 web1 kernel: Call Trace:
 Mar  8 03:33:20 web1 kernel:  [80063c6f]
 __mutex_lock_slowpath+0x60/0x9b
snip
Anyone else smell an OOM killer? But it's clearly whatever the wget's
after that's killing the system.

   mark



___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread PJ
On Fri, Mar 11, 2011 at 10:06 AM,  m.r...@5-cent.us wrote:
 PJ wrote:
 This may or may not be CentOS related, but am out of ideas at this point
 and wanted to bounce this off the list.

 I'm running a CentOS 5.5 server, running the latest kernel
 2.6.18-194.32.1.el5.

 Almost everyday around 3:30 AM the server completely locks up and has to
 be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I love
 being an internet janitor! :)

 Please log of the Internet. We are cleaning it. You may log back on later.

 snip
 I was able to pull this from /var/log/messages, this happens just
 seconds before locking up completely...

 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than
 120 seconds.
 Mar  8 03:33:19 web1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar  8 03:33:19 web1 kernel: wget          D 810001004420     0
 13608  13607                     (NOTLB)
 Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
 81007bc7bd88 81000100d3f8
 Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
 8100849db0c0 80308b60
 Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
 8100849db2a8 64c82eae
 Mar  8 03:33:19 web1 kernel: Call Trace:
 Mar  8 03:33:20 web1 kernel:  [80063c6f]
 __mutex_lock_slowpath+0x60/0x9b
 snip
 Anyone else smell an OOM killer? But it's clearly whatever the wget's
 after that's killing the system.

           mark



 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos


What makes no sense to me is this runs every 5 minutes all day, but
only around 3:30 AM does it look up.

There is nothing in the log that suggests the kernel is having to kill
processes because it is out of resources.

No httpd invoked oom-killer etc... which I have seen before in other
situations.

http://bugs.centos.org/view.php?id=4515 sounds like what I have going
on, but not with kjournald of course...

Thanks,

--
PJ
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread Brunner, Brian T.
centos-boun...@centos.org wrote:

 What makes no sense to me is this runs every 5 minutes all day, but
 only around 3:30 AM does it look up.
 
 http://bugs.centos.org/view.php?id=4515 sounds like what I have going
 on, but not with kjournald of course...
 
 Thanks,

Can you skip the runs from 3 to 4 AM  (or 3:15 to 3:45)? It might be
what the target of wget is doing on *their* end that makes your access
fatal to your machine.

Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
***
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread Les Mikesell
On 3/11/2011 12:20 PM, PJ wrote:

 What makes no sense to me is this runs every 5 minutes all day, but
 only around 3:30 AM does it look up.

When did it start happening?  Did it correspond to any hardware change 
or software update that you can pin down?

 http://bugs.centos.org/view.php?id=4515 sounds like what I have going
 on, but not with kjournald of course...

Is anything else happening that might make the disks busy around then? 
Maybe the raid-check job in cron.weekly is still running?

-- 
   Les Mikesell
lesmikes...@gmail.com
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread m . roth
PJ wrote:
 On Fri, Mar 11, 2011 at 10:06 AM,  m.r...@5-cent.us wrote:
 PJ wrote:
snip
 I'm running a CentOS 5.5 server, running the latest kernel
 2.6.18-194.32.1.el5.

 Almost everyday around 3:30 AM the server completely locks up and has
 to be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I love
 being an internet janitor! :)
 snip
 I was able to pull this from /var/log/messages, this happens just
 seconds before locking up completely...

 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
 than 120 seconds.
 Mar  8 03:33:19 web1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar  8 03:33:19 web1 kernel: wget          D 810001004420     0
 13608  13607                     (NOTLB)
 Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
 81007bc7bd88 81000100d3f8
 Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
 8100849db0c0 80308b60
 Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
 8100849db2a8 64c82eae
 Mar  8 03:33:19 web1 kernel: Call Trace:
 Mar  8 03:33:20 web1 kernel:  [80063c6f]
 __mutex_lock_slowpath+0x60/0x9b
 snip
 Anyone else smell an OOM killer? But it's clearly whatever the wget's
 after that's killing the system.

 What makes no sense to me is this runs every 5 minutes all day, but
 only around 3:30 AM does it look up.

 There is nothing in the log that suggests the kernel is having to kill
 processes because it is out of resources.

 No httpd invoked oom-killer etc... which I have seen before in other
 situations.

 http://bugs.centos.org/view.php?id=4515 sounds like what I have going
 on, but not with kjournald of course...

Couple things: a few weeks ago, we were getting OOM Killer running with no
log entries, but that was due to someone starting a parallel processing
job that wanted all the cores... and near the end, wanted half again the
memory, and *all* the threads hit that point apparently so fast OOM Killer
didn't have time or memory to run.

Another thing: it may be running every five minutes, but you might want to
look at what it gets at 03:30 that might be different than the rest of the
day, such as a major backup, or an entire day's reconsiliations, complete
with gigabytes of scans

   mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread m . roth
Les Mikesell wrote:
 On 3/11/2011 12:20 PM, PJ wrote:

 What makes no sense to me is this runs every 5 minutes all day, but
 only around 3:30 AM does it look up.

 When did it start happening?  Did it correspond to any hardware change
 or software update that you can pin down?

 http://bugs.centos.org/view.php?id=4515 sounds like what I have going
 on, but not with kjournald of course...

 Is anything else happening that might make the disks busy around then?
 Maybe the raid-check job in cron.weekly is still running?

Any chance of something *not* on your server, such as a router or firewall
doing something then?

 mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread Steve Thompson

 PJ wrote:
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than
 120 seconds.

Check the number of dirty pages:

grep Dirty /proc/meminfo

relative to the dirty_ratio setting:

cat /proc/sys/vm/dirty_ratio

to see if the system is going into synhronous flush mode around that time 
(especially if dirty_ratio is large and you have a lot of physical 
memory). This is what I usually see as the cause of the blocked for more 
than message. I've also found that it can be several minutes, and up to 
20 minutes, before the system recovers (but recover it always does).

-Steve
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.

2011-03-11 Thread Denniston, Todd A CIV NAVSURFWARCENDIV Crane
 -Original Message-
 From: centos-boun...@centos.org [mailto:centos-boun...@centos.org] On
 Behalf Of PJ
 Sent: Friday, March 11, 2011 12:34
 To: centos@centos.org
 Subject: [CentOS] Server locking up everyday around 3:30 AM - (INFO:
task
 wget:13608 blocked for more than 120 seconds) need sleep, help.
SNIP
 There are no cron jobs that run at 3:30, no backups, the server has a
 load of 0, nothing is scheduled around that time...
 
SNIP

Are you sure the stuff in /etc/cron.daily/ is done by then or not
started yet?
Could be something like the mlocate or makewhatis chewing up CPU/Mem.
IIRC the stuff in /etc/cron.daily/ runs in alphabetic order so, are you
(root) getting the logwatch messages, and at what time?
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread PJ
On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson s...@vgersoft.com wrote:

 PJ wrote:
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more than
 120 seconds.

 Check the number of dirty pages:

        grep Dirty /proc/meminfo

 relative to the dirty_ratio setting:

        cat /proc/sys/vm/dirty_ratio

 to see if the system is going into synhronous flush mode around that time
 (especially if dirty_ratio is large and you have a lot of physical
 memory). This is what I usually see as the cause of the blocked for more
 than message. I've also found that it can be several minutes, and up to
 20 minutes, before the system recovers (but recover it always does).

 -Steve
 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos


Great replies from everyone, I really appreciate the feedback.

Interesting entries in /var/log/cron:

-snip-
(this runs 24/7 every 5 minutes as normal...)

Mar 11 02:20:01 web1 crond[12919]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)
Mar 11 02:25:01 web1 crond[12950]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)
Mar 11 02:30:01 web1 crond[12969]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)
Mar 11 02:35:01 web1 crond[12992]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)
Mar 11 02:40:01 web1 crond[13014]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)
Mar 11 02:45:01 web1 crond[13218]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)

-snip-
(fast forward to 3 AM, the same cron job starts getting delayed by
3:27 the server was non responsive. Never seen this before)

Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts /etc/cron.hourly)
Mar 11 03:07:20 web1 crond[13727]: (webuser) error: Job execution of
per-minute job scheduled for 03:05 delayed into subsequent minute
03:07. Skipping job run.
Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set
security context
Mar 11 03:13:00 web1 crond[13825]: (webuser) error: Job execution of
per-minute job scheduled for 03:10 delayed into subsequent minute
03:13. Skipping job run.
Mar 11 03:13:00 web1 crond[13825]: CRON (webuser) ERROR: cannot set
security context
Mar 11 03:19:29 web1 crond[13854]: (webuser) error: Job execution of
per-minute job scheduled for 03:15 delayed into subsequent minute
03:19. Skipping job run.
Mar 11 03:20:16 web1 crond[13890]: (webuser) CMD (wget -q
www.domain.com/cron.php /dev/null 21)
Mar 11 03:21:01 web1 crond[13854]: CRON (webuser) ERROR: cannot set
security context
Mar 11 03:27:41 web1 crond[13912]: (webuser) error: Job execution of
per-minute job scheduled for 03:25 delayed into subsequent minute
03:27. Skipping job run.
Mar 11 03:27:42 web1 crond[13912]: CRON (webuser) ERROR: cannot set
security context
Mar 11 03:32:05 web1 crond[13930]: (webuser) error: Job execution of
per-minute job scheduled for 03:30 delayed into subsequent minute
03:32. Skipping job run.
Mar 11 03:32:05 web1 crond[13930]: CRON (webuser) ERROR: cannot set
security context
Mar 11 03:36:23 web1 crond[13948]: (webuser) error: Job execution of
per-minute job scheduled for 03:35 delayed into subsequent minute
03:36. Skipping job run.
Mar 11 03:36:23 web1 crond[13948]: CRON (webuser) ERROR: cannot set
security context

(rebooted)
Mar 11 03:41:15 web1 crond[4776]: (CRON) STARTUP (V5.0)

-snip-

I don't think it is a coincidence I'm seeing CRON (webuser) ERROR:
cannot set security context around the same time the server stops
responding.

I'm not familiar with this message, anyone here seen it?

cron daily fires off at 4:02, after all this stuff...

nothing in cron.hourly..

Getting warmer I think, but still cant figure it out!

--
PJ
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread m . roth
PJ wrote:
 On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson s...@vgersoft.com wrote:

 PJ wrote:
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
than 120 seconds.
snip
 Great replies from everyone, I really appreciate the feedback.

 Interesting entries in /var/log/cron:
snip
 Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts
/etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error:
Job execution of per-minute job scheduled for 03:05 delayed into
subsequent minute 03:07. Skipping job run.
 Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set
security context
snip
SELINUX! Look at /var/log/messages for an selinux error: if you don't have
sealert, install the package, then use it on /var/log/audit.

Or put selinux in permissive mode.

mark



___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread PJ
On Fri, Mar 11, 2011 at 2:33 PM,  m.r...@5-cent.us wrote:
 PJ wrote:
 On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson s...@vgersoft.com wrote:

 PJ wrote:
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
 than 120 seconds.
 snip
 Great replies from everyone, I really appreciate the feedback.

 Interesting entries in /var/log/cron:
 snip
 Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts
 /etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error:
 Job execution of per-minute job scheduled for 03:05 delayed into
 subsequent minute 03:07. Skipping job run.
 Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set
 security context
 snip
 SELINUX! Look at /var/log/messages for an selinux error: if you don't have
 sealert, install the package, then use it on /var/log/audit.

 Or put selinux in permissive mode.

        mark



 ___
 CentOS mailing list
 CentOS@centos.org
 http://lists.centos.org/mailman/listinfo/centos


I thought about that :)

[root@web1 ~]# sestatus
SELinux status: disabled

It's already disabled and always has been.

Thanks,

--
PJ
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task wget:13608 blocked for more than 120 seconds) need sleep, help.

2011-03-11 Thread Ross Walker
On Mar 11, 2011, at 12:33 PM, PJ pauljer...@gmail.com wrote:

 This may or may not be CentOS related, but am out of ideas at this
 point and wanted to bounce this off the list.
 
 I'm running a CentOS 5.5 server, running the latest kernel 
 2.6.18-194.32.1.el5.
 
 Almost everyday around 3:30 AM the server completely locks up and has
 to be power cycled before it will come back online.
 (this means someone hat to wake up and reboot the server, oh how I
 love being an internet janitor! :)
 
 Smells like a hardware issue to me too, but I went through all of the
 dell diagnostics, updated the firmware, everything checks out as being
 okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
 tech. No hardware issues, at least that we were able to find.
 
 There are no cron jobs that run at 3:30, no backups, the server has a
 load of 0, nothing is scheduled around that time...
 
 The only crontab entry at all is */5 * * * * wget -q
 www.websitedomain.com/cron.php /dev/null 21
 They are running Magento for commerce purposes and this runs every 5 minutes.
 
 Why does the server only lockup around 3:30 AM? Because it's knows I
 am fast asleep?
 
 I was able to pull this from /var/log/messages, this happens just
 seconds before locking up completely...
 
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
 than 120 seconds.
 Mar  8 03:33:19 web1 kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Mar  8 03:33:19 web1 kernel: wget  D 810001004420 0
 13608  13607 (NOTLB)
 Mar  8 03:33:19 web1 kernel:  81007bc7bc78 0086
 81007bc7bd88 81000100d3f8
 Mar  8 03:33:19 web1 kernel:  81007bc7bbf0 0007
 8100849db0c0 80308b60
 Mar  8 03:33:19 web1 kernel:  00013a2964cdf439 3237
 8100849db2a8 64c82eae
 Mar  8 03:33:19 web1 kernel: Call Trace:
 Mar  8 03:33:20 web1 kernel:  [80063c6f]
 __mutex_lock_slowpath+0x60/0x9b
 Mar  8 03:33:20 web1 kernel:  [80063cb9] .text.lock.mutex+0xf/0x14
 Mar  8 03:33:20 web1 kernel:  [8000cf82] do_lookup+0x90/0x1e6
 Mar  8 03:33:20 web1 kernel:  [8000a29c] 
 __link_path_walk+0xa01/0xf5b
 Mar  8 03:33:20 web1 kernel:  [8000ea4b] link_path_walk+0x42/0xb2
 Mar  8 03:33:20 web1 kernel:  [8000cd72] do_path_lookup+0x275/0x2f1
 Mar  8 03:33:23 web1 kernel:  [80012851] getname+0x15b/0x1c2
 Mar  8 03:33:23 web1 kernel:  [800239d1] __user_walk_fd+0x37/0x4c
 Mar  8 03:33:23 web1 kernel:  [80028905] vfs_stat_fd+0x1b/0x4a
 Mar  8 03:33:23 web1 kernel:  [80023703] sys_newstat+0x19/0x31
 Mar  8 03:33:23 web1 kernel:  [8005d116] system_call+0x7e/0x83
 
 If anyone has some advice on where to go from here it would be greatly
 appreciated.

Do a fsck of the file system wget is writing to as there might be a corruption 
it hits only on the 3:30am run as that's when the other vendor dumps data to be 
downloaded.

You could also check to see if a RAID  patrol read (scrub/predictive failure 
detection) is happening around this time as well and disable/reschedule it.

-Ross

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread m . roth
PJ wrote:
 On Fri, Mar 11, 2011 at 2:33 PM,  m.r...@5-cent.us wrote:
 PJ wrote:
 On Fri, Mar 11, 2011 at 11:05 AM, Steve Thompson s...@vgersoft.com
 wrote:

 PJ wrote:
 Mar  8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
 than 120 seconds.
 snip
 Great replies from everyone, I really appreciate the feedback.

 Interesting entries in /var/log/cron:
 snip
 Mar 11 03:01:01 web1 crond[13613]: (root) CMD (run-parts
 /etc/cron.hourly) Mar 11 03:07:20 web1 crond[13727]: (webuser) error:
 Job execution of per-minute job scheduled for 03:05 delayed into
 subsequent minute 03:07. Skipping job run.
 Mar 11 03:07:20 web1 crond[13727]: CRON (webuser) ERROR: cannot set
 security context
 snip
 SELINUX! Look at /var/log/messages for an selinux error: if you don't
 have sealert, install the package, then use it on /var/log/audit.

 Or put selinux in permissive mode.

 I thought about that :)

 [root@web1 ~]# sestatus
 SELinux status: disabled

 It's already disabled and always has been.

acl's?

   mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Server locking up everyday around 3:30 AM

2011-03-11 Thread m . roth
Here's a thought: see what directory the wget's trying to put the file in.

   mark

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos