Mike, Are you sure it's not possible for sdb to be idle for just 1 second? If you look at the interval right after the one you pointed out, you'll see r/s is 2.97 and w/s is .99, so it did 3 reads and 1 write in that one second interval. The device appears to be used very little. I think it's quite possible that some 1 second intervals have no reads or writes at all, don't you think?
Thanks, Herbert. mike wrote: > Thanks. > > If I have the opportunity to run the (buggy) new kernel again I will > try this. That is a definately problem and I think I need to set the > oracle behavior to crash and not auto reboot for this to be effective, > right? > > That is just one issue. > 1) 2.6.24-16 with load completely crashes node producing largest i/o > 2) 2.6.22-19 utilization drops to 0% and causes a hiccup randomly (I > don't see a pattern and no batch jobs, or other things running at the > time it happens) - this is more important as it still is happening > even though I'm runnign the more "stable" kernel. > > > On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: > >> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/networking/netconsole.txt;h=3c2f2b3286385337ce5ec24afebd4699dd1e6e0a;hb=HEAD >> >> netconsole is a facility to capture oops traces. It is not a console >> per se and does not require a head/gtk/x11 etc to work. The link above >> explains the usage, etc. >> >> >> mike wrote: >> >>> Well these are headless production servers, CLI only. no GTK, no X11. >>> also I am not running the newer kernels (and I can't...) it looks like >>> I cannot run a hybrid of 2.6.24-16 and 2.6.22-19, whichever one has >>> mounted the drive first is the winner. >>> >>> If I mix them, I can get the 2.6.24's to mount, then the older ones >>> give the "number too large" error or whatever. So I can't currently >>> use one server on my cluster to test because it would require >>> upgrading all of them just for this test. >>> >>> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: >>> >>> >>> >>>> Setting up netconsole does not require a reboot. The idea is to >>>> catch the oops trace when the oops happens. Without that trace, >>>> we are flying blind. >>>> >>>> >>>> mike wrote: >>>> >>>> >>>> >>>>> Since these are production I can't do much. >>>>> >>>>> But I did get an error (it's not happening as much but it still blips >>>>> here and there) >>>>> >>>>> Notice that /dev/sdb (my iscsi target using ocfs2) hits 0.00% >>>>> utilization, 3 seconds before my proxy says "hey, timeout" - every >>>>> other second there is -always- some utilization going on. >>>>> >>>>> What could be steps to figure out this issue? Using debugfs.ocfs2 or >>>>> >>>>> >>>>> >>>> something? >>>> >>>> >>>> >>>>> It's mounted as: >>>>> /dev/sdb1 on /home type ocfs2 >>>>> (rw,_netdev,noatime,data=writeback,heartbeat=local) >>>>> >>>>> I know I'm not being much help, but I'm willing to try almost anything >>>>> as long as it doesn't cause downtime or require cluster-wide changes >>>>> (since those require downtime...) - I want to try to go back to >>>>> 2.6.24-16 with data=writeback and see if that fixes the crashing >>>>> issue, but if I'm having issues already like this perhaps I should >>>>> resolve this before moving up. >>>>> >>>>> >>>>> >>>>> [EMAIL PROTECTED] ~]# cat /root/web03-iostat.txt >>>>> >>>>> Time: 02:11:46 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 3.71 0.00 27.23 8.91 0.00 60.15 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 54.46 0.00 309.90 0.00 2914.85 >>>>> 9.41 23.08 74.47 0.93 28.71 >>>>> sdb 12.87 0.00 17.82 0.00 245.54 0.00 >>>>> 13.78 0.33 17.78 18.33 32.67 >>>>> >>>>> Time: 02:11:47 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.25 0.00 26.24 2.23 0.00 71.29 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> 0.00 0.00 0.00 0.00 0.00 >>>>> sdb 5.94 0.00 22.77 0.99 228.71 0.99 >>>>> 9.67 0.42 17.92 17.08 40.59 >>>>> >>>>> Time: 02:11:48 PM <- THIS HAS THE ISSUE >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.00 0.00 25.99 0.00 0.00 74.01 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 10.89 0.00 2.97 0.00 110.89 >>>>> 37.33 0.00 0.00 0.00 0.00 >>>>> sdb 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> 0.00 0.00 0.00 0.00 0.00 >>>>> >>>>> >>>>> Time: 02:11:49 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.25 0.00 14.85 0.99 0.00 83.91 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> 0.00 0.00 0.00 0.00 0.00 >>>>> sdb 0.99 0.00 2.97 0.99 30.69 0.99 >>>>> 8.00 0.07 17.50 17.50 6.93 >>>>> >>>>> Time: 02:11:50 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.74 0.00 1.24 1.73 0.00 96.29 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> 0.00 0.00 0.00 0.00 0.00 >>>>> sdb 0.99 0.00 5.94 0.00 55.45 0.00 >>>>> 9.33 0.07 11.67 11.67 6.93 >>>>> >>>>> Time: 02:11:51 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.00 0.00 1.24 16.34 0.00 82.43 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 153.47 0.00 494.06 0.00 5156.44 >>>>> 10.44 55.62 107.23 1.16 57.43 >>>>> sdb 2.97 0.00 11.88 0.99 117.82 0.99 >>>>> 9.23 0.26 13.08 20.00 25.74 >>>>> >>>>> Time: 02:11:52 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.00 0.00 0.25 3.22 0.00 96.53 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 0.00 0.00 16.83 0.00 158.42 >>>>> 9.41 0.13 164.71 1.18 1.98 >>>>> sdb 1.98 0.00 2.97 0.00 39.60 0.00 >>>>> 13.33 0.13 73.33 43.33 12.87 >>>>> >>>>> Time: 02:11:53 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 0.50 0.00 0.25 4.70 0.00 94.55 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 0.00 0.00 0.00 0.00 0.00 >>>>> 0.00 0.00 0.00 0.00 0.00 >>>>> sdb 5.94 0.00 11.88 0.99 141.58 0.99 >>>>> 11.08 0.20 15.38 15.38 19.80 >>>>> >>>>> Time: 02:11:54 PM >>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>> 3.96 0.00 10.15 0.74 0.00 85.15 >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >>>>> avgrq-sz avgqu-sz await svctm %util >>>>> sda 0.00 20.79 0.00 4.95 0.00 205.94 >>>>> 41.60 0.00 0.00 0.00 0.00 >>>>> sdb 4.95 0.00 5.94 0.00 87.13 0.00 >>>>> 14.67 0.07 11.67 11.67 6.93 >>>>> >>>>> >>>>> >>>>> On 4/21/08, Sunil Mushran <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Do you have the panic output... kernel stack trace. We'll need >>>>>> that to figure this out. Without that, we can only speculate. >>>>>> >>>>>> mike wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> mike wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I have changed my kernel back to 2.6.22-14-server, and now I >>>>>>>>> >> don't >> >>>>>>>>> >>>> get >>>> >>>> >>>> >>>>>>>>> the kernel panics. It seems like an issue with 2.6.24-16 and >>>>>>>>> >> some >> >>>>>>>>> >>>> i/o >>>> >>>> >>>> >>>>>>>>> made it crash... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> OK, so it seems that it is a bug for ocfs2 kernel, not the >>>>>>>> >>>>>>>> >>>>>>>> >>>> ocfs2-tools. >>>> >>>> >>>> >>>>>>>> >>>>>> :) >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> Then could you please describe it in more detail about how the >>>>>>>> >>>>>>>> >>>>>>>> >>>> kernel >>>> >>>> >>>> >>>>>>>> >>>>>> panic >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> happens? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Yeah, this specific issue seems like a kernel issue. >>>>>>> >>>>>>> I don't know, these are production systems and I am already >>>>>>> >> getting >> >>>>>>> angry customers. I can't really test anymore. Both are standard >>>>>>> >> Ubuntu >> >>>>>>> kernels. >>>>>>> >>>>>>> Okay: 2.6.22-14-server (I think still minor file access issues) >>>>>>> Breaks under load: 2.6.24-16-server >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> However I am still getting file access timeouts once in a >>>>>>>>> >> while. I >> >>>>>>>>> >>>> am >>>> >>>> >>>> >>>>>>>>> nervous about putting more load on the setup. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> Also please provide more details about it. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> I am using nginx for a frontend load balancer, and nginx for a >>>>>>> webserver as well. This doesn't seem to be related to the >>>>>>> >> webserver at >> >>>>>>> all though, it was happening before this. >>>>>>> >>>>>>> lvs01 proxies traffic in to web01, web02, and web03 (currently >>>>>>> >> using >> >>>>>>> nginx, before I was using LVS/ipvsadm) >>>>>>> >>>>>>> Every so often, one of the webservers sends me back >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> [EMAIL PROTECTED] .batch]# cat /etc/default/o2cb >>>>>>>>> >>>>>>>>> # O2CB_ENABLED: 'true' means to load the driver on boot. >>>>>>>>> O2CB_ENABLED=true >>>>>>>>> >>>>>>>>> # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to >>>>>>>>> >> start. >> >>>>>>>>> O2CB_BOOTCLUSTER=mycluster >>>>>>>>> >>>>>>>>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is >>>>>>>>> >> considered >> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>> dead. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>> O2CB_HEARTBEAT_THRESHOLD=7 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> This value is a little smaller, so how did you build up your >>>>>>>> >> shared >> >>>>>>>> disk(iSCSI or ...)? The most common value I heard of is 61. It >>>>>>>> >> is >> >>>>>>>> >>>> about >>>> >>>> >>>> >>>>>>>> >>>>>> 120 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>> secs. I don't know the reason and maybe Sunil can tell you. ;) >>>>>>>> You can also refer to >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >> http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT. >> >>>> >>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection >>>>>>>>> >> is >> >>>>>>>>> considered dead. >>>>>>>>> O2CB_IDLE_TIMEOUT_MS=10000 >>>>>>>>> >>>>>>>>> # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>> packet is >>>> >>>> >>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> sent >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> O2CB_KEEPALIVE_DELAY_MS=5000 >>>>>>>>> >>>>>>>>> # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>> attempts >>>> >>>> >>>> >>>>>>>>> O2CB_RECONNECT_DELAY_MS=2000 >>>>>>>>> >>>>>>>>> >>>>>>>>> On 4/21/08, Tao Ma <[EMAIL PROTECTED]> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi Mike, >>>>>>>>>> Are you sure it is caused by the update of ocfs2-tools? >>>>>>>>>> AFAIK, the ocfs2-tools only include tools like mkfs, fsck >>>>>>>>>> >> and >> >>>>>>>>>> >>>> tunefs >>>> >>>> >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> etc. So >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> if you don't make any change to the disk(by using this new >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> tools), >>>> >>>> >>>> >>>>>>>>>> >>>>>> it >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>>> shouldn't cause the problem of kernel panic since they are >>>>>>>>>> >> all >> >>>>>>>>>> >>>> user >>>> >>>> >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> space >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> tools. >>>>>>>>>> Then there is only one thing maybe. Have you modify >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> /etc/sysconfig/o2cb(This >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> is the place for RHEL, not sure the place in ubuntu)? I have >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> checked >>>> >>>> >>>> >>>>>>>>>> >>>>>> the >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>>> >>>>>>>> rpm >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> package for RHEL, it will update /etc/sysconfig/o2cb and >>>>>>>>>> >> this >> >>>>>>>>>> >>>> file >>>> >>>> >>>> >>>>>>>>>> >>>>>> has >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>>> >>>>>>>> some >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> timeouts defined in it. >>>>>>>>>> So do you have some backups for this file? If yes, please >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>> restore it >>>> >>>> >>>> >>>>>>>>>> >>>>>> to >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>>>> >>>>>>>> see >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> whether it helps(I can't say it for sure). >>>>>>>>>> If not, do you remember the old value of some timeouts you >>>>>>>>>> >> set >> >>>>>>>>>> >>>> for >>>> >>>> >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> ocfs2? If >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> yes, you can use o2cb configure to set them by yourself. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Ocfs2-users mailing list >>>>>>> [email protected] >>>>>>> >>>>>>> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Ocfs2-users mailing list >>>>> [email protected] >>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >> > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
