Re: [ewg] possible bug in rds?

2010-03-10 Thread Eli Cohen
On Wed, Mar 10, 2010 at 03:51:36PM -0800, Andy Grover wrote:
> 
> I've opened a bug:
> 
> https://bugs.openfabrics.org/show_bug.cgi?id=1983
> 
> Did this just start happening?  What is the test doing when this
> occurred? Please add to the bug if possible, and I'll try to diagnose
> further.
> 

Follow up response in bugzilla.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] possible bug in rds?

2010-03-10 Thread Andy Grover
Eli Cohen wrote:
> Hi Andy,
> 
> in our regression tests we've encountered a kernel oops with the
> following stack dump:



> Examining the dump I see the failure results in trying to call
> hlist_del() twice on the same pointer (I can see that by the poisoned
> pointer RCX: 00200200).
> Could it be that rds will call rdma_destroy_id() which will result in
> the described behaviour?

I've opened a bug:

https://bugs.openfabrics.org/show_bug.cgi?id=1983

Did this just start happening?  What is the test doing when this
occurred? Please add to the bug if possible, and I'll try to diagnose
further.

Thanks -- Regards -- Andy

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] possible bug in rds?

2010-03-10 Thread Eli Cohen
Hi Andy,

in our regression tests we've encountered a kernel oops with the
following stack dump:


Call trace: 
Mar  1 05:45:50 sw134 kernel: mlx4_en: eth2: Link Down 
Mar  1 05:46:00 sw134 kernel: mlx4_en: eth2: Link Up 
Mar  1 05:46:00 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Mar  1 05:46:01 sw134 /usr/sbin/cron[16940]: (root) CMD 
(/mswg/projects/test_suite2/etc/check_daemon.csh >/dev/null) 
Mar  1 05:46:01 sw134 /usr/sbin/cron[16941]: (root) CMD (/usr/check_mswg.csh 
>/dev/null) 
Mar  1 05:46:01 sw134 /usr/sbin/cron[16942]: (root) CMD 
(/.autodirect/LIT/CRONTABS/do_it_now.sh > /dev/null) 
Mar  1 05:46:03 sw134 kernel: Unable to handle kernel paging request at 
00200200 RIP:
Mar  1 05:46:03 sw134 kernel: {:rdma_cm:rdma_destroy_id+399} 
Mar  1 05:46:03 sw134 kernel: PGD 0 
Mar  1 05:46:03 sw134 kernel: Oops: 0002 [1] SMP 
Mar  1 05:46:03 sw134 kernel: last sysfs file: 
/class/infiniband/mlx4_0/ports/1/gids/127 
Mar  1 05:46:03 sw134 kernel: CPU 0 
Mar  1 05:46:03 sw134 kernel: Modules linked in: 8021q mst_pciconf mst_pci 
rdma_ucm rds_tcp rds_rdma rds ib_ucm ib_sdp rdma_cm iw_cm ib_addr ib_cm ib_sa 
ib_uverbs ib_umad mlx4_en mlx4_core ib_mad ib_core memtrack autofs4 
cpufreq_ondemand cpufreq_userspace cpufreq_powersave powernow_k8 freq_table nfs 
lockd nfs_acl sunrpc ipv6 af_packet dock button battery ac apparmor 
nls_iso8859_1 nls_cp437 vfat fat loop dm_mod ohci_hcd ide_cd cdrom generic 
ehci_hcd shpchp pci_hotplug i2c_piix4 i2c_core usbcore mptctl tg3 floppy ext3 
jbd edd fan thermal processor mptsas mptscsih sg mptbase scsi_transport_sas 
sata_svw libata serverworks sd_mod scsi_mod ide_disk ide_core 
Mar  1 05:46:03 sw134 kernel: Pid: 15000, comm: krdsd Tainted: GU 
2.6.16.60-0.54.5-smp #1
Mar  1 05:46:03 sw134 kernel: RIP: 0010:[] 
{:rdma_cm:rdma_destroy_id+399} 
Mar  1 05:46:03 sw134 kernel: RSP: 0018:81000dad7dd8  EFLAGS: 00010206 
Mar  1 05:46:03 sw134 kernel: RAX: 00100100 RBX: 81012d2ba740 RCX: 
00200200 
Mar  1 05:46:03 sw134 kernel: RDX: 81010ee445b8 RSI: 8101248c0048 RDI: 
81012bdaf800 
Mar  1 05:46:03 sw134 kernel: RBP: 81010ee44400 R08:  R09: 
 
Mar  1 05:46:03 sw134 kernel: R10:  R11:  R12: 
8101248c0048 
Mar  1 05:46:03 sw134 kernel: R13: 8101248c0290 R14: 8846ca40 R15: 
 
Mar  1 05:46:03 sw134 kernel: FS:  2b2f96622ae0() 
GS:803dc000() knlGS: 
Mar  1 05:46:03 sw134 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b 
Mar  1 05:46:03 sw134 kernel: CR2: 00200200 CR3: 00101000 CR4: 
06e0 
Mar  1 05:46:03 sw134 kernel: Process krdsd (pid: 15000, threadinfo 
81000dad6000, task 810126bf5040) 
Mar  1 05:46:03 sw134 kernel: Stack: 81012bdaf800 81012bdaf800 
81010e7fa000 884891f4 
Mar  1 05:46:03 sw134 kernel:81000100f700 230b363c2cf0 
0002220f 0f5eb200 
Mar  1 05:46:03 sw134 kernel:8101248c0048 802f0652 
Mar  1 05:46:03 sw134 kernel: Call Trace: 
{:rds_rdma:rds_ib_conn_shutdown+477} 
Mar  1 05:46:03 sw134 kernel:{mutex_lock+13} 
{:rds:rds_shutdown_worker+163} 
Mar  1 05:46:04 sw134 kernel: {run_workqueue+139} 
{worker_thread+0} 
Mar  1 05:46:04 sw134 kernel: {keventd_create_kthread+0} 
{worker_thread+244} 
Mar  1 05:46:04 sw134 kernel: {default_wake_function+0} 
{kthread+236} 
Mar  1 05:46:04 sw134 kernel:{child_rip+8} 
{keventd_create_kthread+0} 
Mar  1 05:46:04 sw134 kernel:{kthread+0} 
{child_rip+0} 
Mar  1 05:46:04 sw134 kernel: 
Mar  1 05:46:04 sw134 kernel: Code: 48 89 01 74 04 48 89 48 08 48 c7 85 b8 01 
00 00 00 01 10 00 
Mar  1 05:46:04 sw134 kernel: RIP 
{:rdma_cm:rdma_destroy_id+399} RSP  
Mar  1 05:46:04 sw134 kernel: CR2: 00200200 
Mar  1 05:46:09 sw134 kernel:  <6>mlx4_en: eth2: Link Down 
Mar  1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up 
Mar  1 05:46:20 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready 
Mar  1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Down 
Mar  1 05:46:20 sw134 kernel: mlx4_en: eth2: Link Up 
Mar  1 05:46:21 sw134 kernel: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready 


Examining the dump I see the failure results in trying to call
hlist_del() twice on the same pointer (I can see that by the poisoned
pointer RCX: 00200200).
Could it be that rds will call rdma_destroy_id() which will result in
the described behaviour?



___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg