Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list

2011-09-30 Thread Herman L
On Thursday, September 29, 2011 2:04 PM Sunil Mushran wrote:
On 09/29/2011 08:56 AM, Herman L wrote:
 On Wednesday, September 21, 2011 4:00 PM, Sunil Mushran wrote:
 On 09/21/2011 12:37 PM, Herman L wrote:
 On 09/19/2011 08:35 AM, Herman L wrote:
 Hi all,

 Got a couple of these messages recently, but I don't know what they 
 mean.  Can anyone let me know if I need to panic?  I'm using OCFS2 
 compiled from the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64.

 Sep 19 08:07:15 server-1 kernel: [3892420.40] 
 (10387,12):dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list
 Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: 
 W0001b027d69b591f1, owner=1, state=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398195]  last used: 
 8197071325, refcnt: 0, on purge list: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398197]  on dirty list: no, on 
 reco list: no, migrating pending: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398198]  inflight locks: 0, 
 asts reserved: 0
 Sep 19 08:07:15 server-1 kernel: [3892420.398199]  refmap nodes: [ ], 
 inflight=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]  granted queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]  converting queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398201]  blocked queue:

 Thanks!
 Herman
 From: Sunil Mushran
 To: Herman L
 Sent: Monday, September 19, 2011 12:57 PM
 Subject: Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list

  I've no idea of the state of the source that you are using. The 
message
  is a warning indicating a race. While it probably did not affect the 
functioning,
  there is no guarantee that that would be the case the next time 
around.

  The closest relevant patch is over 2 years old.
 http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710
 Thanks Sunil for responding.  I know you can't easily support my setup, 
 but anyways I checked the sources.

 Looks like the patch you mention is in the sources I compiled from ( 
 RHEL6.0 kernel-2.6.32-71.24.1.el6.src.rpm ), so I guess the source of the 
 problem is elsewhere.

 The fs/ocfs2 directory from the RHEL6 sources I compiled from is almost 
 exactly the same as the mainline 2.6.32 kernel, except
 1) It looks like they implemented the changes in aops.c from the cleanup 
 blockdev_direct_IO locking patch that's in 2.6.33.
 2) In journal.c, they rename ocfs2_commit_trigger to ocfs2_frozen_trigger, 
 which seems to be from 2.6.35.
 3) In cluster/masklog.c they add a const to the mlog_attr_ops declaration
 4) And in quota.h, they are missing #define QFMT_OCFS2 3

 Not sure if that helps any, but thanks in any case!
 All those changes are ok. And unrelated. This is a new one.

 Sorry, I think I accidentally wrote a message with only the quoted block... 
 oops.  Sorry.


 Sunil, are you able to and interested in looking at this issue?  If so, is 
 there any information that I can provide that might help?  Fortunately, 
 after those few initial days of daily errors, it seems to have stopped for 
 now.  But of course, I'm still worried about this.


http://oss.oracle.com/~smushran/0001-ocfs2-dlm-Use-dlm-track_lock-when-adding-resource-to.patch

This should fix it. But do note that the patch is untested.

Thanks for the quick reply and patch!  I'll try to test it out when I get a 
chance.  Also, is there any way to force this error so that I can know if that 
patch is working?  Also, now that you have a fix for this, can you make any 
kind of guess as to how likely or what circumstances that the unpatched OCFS2 
will cause  dangerous problems?

Thanks,
Herman

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


[Ocfs2-users] Problem OCFS2 + storage

2011-09-30 Thread Anderson J. Dominitini

  
  
I have a problem with my processing+storage (ocfs2). When we put
  the workflow for run, the headnode show some proccess with ' D '
  state.
  
  We have some locks appearing during proccess. And some dir stay
  inaccessible and we can't run command like ls, rm
  
  I would like how to solve this problem!
  
  Thanks!
  
  Anderson
  


___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] Kernel Bug

2011-09-30 Thread Nick Khamis
Hello Everyone,

I am running a 2 node debian cluster (OCFS2, DRBD, pacmaker). When I
issue an ifdown eth0 on one of the nodes, I got the following:

[ 4772.359815] [ cut here ]
[ 4772.359815] kernel BUG at
/tmp/buildd/linux-2.6-2.6.32/debian/build/source_i386_none/fs/ocfs2/journal.c:1702!
[ 4772.359815] invalid opcode:  [#1] SMP
[ 4772.359815] last sysfs file: /sys/fs/o2cb/interface_revision
[ 4772.359815] Modules linked in: ocfs2 jbd2 quota_tree
ocfs2_stack_o2cb ocfs2_stackglue ocfs2_dlmfs ocfs2_dlm
ocfs2_nodemanager configfs drbd lru_cache cn loop joydev usbhid hid sg
snd_intel8x0 sr_mod snd_ac97_codec cdrom ohci_hcd ac97_bus ehci_hcd
snd_pcm snd_timer psmouse i2c_piix4 usbcore snd ata_generic soundcore
parport_pc serio_raw pcspkr snd_page_alloc parport nls_base ac evdev
vboxguest e1000 button i2c_core battery ata_piix ext3 jbd mbcache
sd_mod crc_t10dif ahci libata thermal thermal_sys scsi_mod
[ 4772.359815]
[ 4772.359815] Pid: 2582, comm: ocfs2rec Tainted: GW
(2.6.32-5-686 #1) VirtualBox
[ 4772.359815] EIP: 0060:[c84f147a] EFLAGS: 00010246 CPU: 0
[ 4772.359815] EIP is at __ocfs2_recovery_thread+0x3af/0x146d [ocfs2]
[ 4772.359815] EAX: 0002 EBX: c5635000 ECX: 0001 EDX: 0001
[ 4772.359815] ESI: 0001 EDI: c1e592b0 EBP:  ESP: c5b77ed4
[ 4772.359815]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 4772.359815] Process ocfs2rec (pid: 2582, ti=c5b76000 task=c54f5980
task.ti=c5b76000)
[ 4772.359815] Stack:
[ 4772.359815]  c54f5980 c54f5980 c563509c 0002 0001 c5635000
c1e592b0 c1e592e0
[ 4772.359815] 0 0002 c768e063 c768e080  0246
c768b248 c1805e20 
[ 4772.359815] 0 c1f82aac  c5704c40 c5704c40 c1413c48
 c767198b c76719a8
[ 4772.359815] Call Trace:
[ 4772.359815]  [c126d21e] ? schedule+0x78f/0x7dc
[ 4772.359815]  [c84f10cb] ? __ocfs2_recovery_thread+0x0/0x146d [ocfs2]
[ 4772.359815]  [c1043dec] ? kthread+0x61/0x66
[ 4772.359815]  [c1043d8b] ? kthread+0x0/0x66
[ 4772.359815]  [c1003d47] ? kernel_thread_helper+0x7/0x10
[ 4772.359815] Code: 00 00 68 24 f7 52 c8 50 ff b2 2c 01 00 00 68 29
85 53 c8 e8 b4 b4 d7 f8 83 c4 20 8b 5c 24 14 8b 44 24 0c 39 83 bc 00
00 00 75 04 0f 0b eb fe 8d 84 24 d0 00 00 00 c7 84 24 d0 00 00 00 00
00 00
[ 4772.359815] EIP: [c84f147a] __ocfs2_recovery_thread+0x3af/0x146d
[ocfs2] SS:ESP 0068:c5b77ed4
[ 4772.363813] ---[ end trace a7919e7f17c0a727 ]---
[ 4776.325831] (1704,0):dlm_get_lock_resource:839
3A791AB36DED41008E58CEF52EBEEFD3:$RECOVERY: at least one node (1) to
recover before lock mastery can begin
[ 4776.325831] (1704,0):dlm_get_lock_resource:873
3A791AB36DED41008E58CEF52EBEEFD3: recovery map is not empty, but must
master $RECOVERY lock now
[ 4776.325831] (1704,0):dlm_do_recovery:523 (1704) Node 2 is the
Recovery Master for the Dead Node 1 for Domain
3A791AB36DED41008E58CEF52EBEEFD3

Thanks in advance,

Nick.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource W0000000000000001b027d69b591f15 not on the Tracking list

2011-09-30 Thread Sunil Mushran
On 09/30/2011 06:49 AM, Herman L wrote:
 On Thursday, September 29, 2011 2:04 PM Sunil Mushran wrote:
 On 09/29/2011 08:56 AM, Herman L wrote:
 On Wednesday, September 21, 2011 4:00 PM, Sunil Mushran wrote:
 On 09/21/2011 12:37 PM, Herman L wrote:
 On 09/19/2011 08:35 AM, Herman L wrote:
 Hi all,

 Got a couple of these messages recently, but I don't know what they 
 mean.  Can anyone let me know if I need to panic?  I'm using OCFS2 
 compiled from the kernel source of RHEL 6.0's 2.6.32-71.18.2.el6.x86_64.

 Sep 19 08:07:15 server-1 kernel: [3892420.40] 
 (10387,12):dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list
 Sep 19 08:07:15 server-1 kernel: [3892420.398194] lockres: 
 W0001b027d69b591f1, owner=1, state=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398195]  last used: 
 8197071325, refcnt: 0, on purge list: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398197]  on dirty list: no, 
 on reco list: no, migrating pending: no
 Sep 19 08:07:15 server-1 kernel: [3892420.398198]  inflight locks: 0, 
 asts reserved: 0
 Sep 19 08:07:15 server-1 kernel: [3892420.398199]  refmap nodes: [ ], 
 inflight=0
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]  granted queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398200]  converting queue:
 Sep 19 08:07:15 server-1 kernel: [3892420.398201]  blocked queue:

 Thanks!
 Herman
 From: Sunil Mushran
 To: Herman L
 Sent: Monday, September 19, 2011 12:57 PM
 Subject: Re: [Ocfs2-users] dlm_lockres_release:507 ERROR: Resource 
 W0001b027d69b591f15 not on the Tracking list

I've no idea of the state of the source that you are using. The 
 message
is a warning indicating a race. While it probably did not affect 
 the functioning,
there is no guarantee that that would be the case the next time 
 around.

The closest relevant patch is over 2 years old.
 http://oss.oracle.com/git/?p=smushran/linux-2.6.git;a=commit;h=b0d4f817ba5de8adb875ace594554a96d7737710
 Thanks Sunil for responding.  I know you can't easily support my setup, 
 but anyways I checked the sources.

 Looks like the patch you mention is in the sources I compiled from ( 
 RHEL6.0 kernel-2.6.32-71.24.1.el6.src.rpm ), so I guess the source of the 
 problem is elsewhere.

 The fs/ocfs2 directory from the RHEL6 sources I compiled from is almost 
 exactly the same as the mainline 2.6.32 kernel, except
 1) It looks like they implemented the changes in aops.c from the cleanup 
 blockdev_direct_IO locking patch that's in 2.6.33.
 2) In journal.c, they rename ocfs2_commit_trigger to 
 ocfs2_frozen_trigger, which seems to be from 2.6.35.
 3) In cluster/masklog.c they add a const to the mlog_attr_ops 
 declaration
 4) And in quota.h, they are missing #define QFMT_OCFS2 3

 Not sure if that helps any, but thanks in any case!
 All those changes are ok. And unrelated. This is a new one.
 Sorry, I think I accidentally wrote a message with only the quoted block... 
 oops.  Sorry.


 Sunil, are you able to and interested in looking at this issue?  If so, is 
 there any information that I can provide that might help?  Fortunately, 
 after those few initial days of daily errors, it seems to have stopped for 
 now.  But of course, I'm still worried about this.

 http://oss.oracle.com/~smushran/0001-ocfs2-dlm-Use-dlm-track_lock-when-adding-resource-to.patch

 This should fix it. But do note that the patch is untested.
 Thanks for the quick reply and patch!  I'll try to test it out when I get a 
 chance.  Also, is there any way to force this error so that I can know if 
 that patch is working?  Also, now that you have a fix for this, can you make 
 any kind of guess as to how likely or what circumstances that the unpatched 
 OCFS2 will cause  dangerous problems?

Well, the first goal is always to see nothing else is breaking. That's the most
important bit. As far as fixing the issue goes, only time will tell. There is no
way I can think of that will definitely prove that the issue is resolved. Also, 
even
if it does reproduce, it does not mean that this patch is bad. It could be there
is another race that we have to plug.

Depends on the definition of dangerous. If it means cluster-wide corruption, or
cluster-wide outage, then no. But if it means a node crashing, then yes. Though
the chance of that is fairly low.

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users