Hi Tom,

I'm running 2.6.27.10 vanilla kernel but I'll try with 2.6.29.

Thanks,

Diego

Sysctl config on server:

[r...@twing ~]# cat /etc/sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled.  See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding
net.ipv4.ip_forward = 0

# Controls source route verification
net.ipv4.conf.default.rp_filter = 1

# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0

# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0

# Controls whether core dumps will append the PID to the core filename
# Useful for debugging multi-threaded applications
kernel.core_uses_pid = 1

# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1

# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536

# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 68719476736

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296
## MLX4_EN tuning parameters ##
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 0
net.core.netdev_max_backlog = 250000
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
## END MLX4_EN ##



[email protected] wrote:
In both cases the connection is being lost under load. This usually indicates a 
credit (slot count) mismatch, or an IRD/ORD one. What kernel version are you 
running on each end? Any special sysctl settings on the server?

The oops on the client is troubling, but it,s happening in the error upcall and 
resembles a problem I fixed a while back. I'll check it when I get back to a 
source repo. It's not the cause of the issue though.

Tom.


-----Original Message-----

From:  Diego Moreno <[email protected]>
Subj:  Re: [Fwd: Re: [ofa-general][NFS/RDMA]Can'tmountNFS/RDMApartition]]
Date:  Tue Apr 28, 2009 8:44 am
Size:  3K
To:  Vu Pham <[email protected]>
cc:  OpenIB <[email protected]>

Hi,

I'm working with Celine trying to make NFS RDMA work. We installed a new firmware (2.6.636). We still have the problem but now we have more information on client side.

- With the workaround (memreg 6) we can mount without any problem. We can read a file but if we try to create a file with dd, application hangs and then we have to do 'umount -f'. There is no message on server. Message on client:

rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)


- With fast registration:

There is no message on server. dmesg client output with fast registration:


rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)
rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16
------------[ cut here ]------------
WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3c/0x92()
Modules linked in: xprtrdma autofs4 hidp nfs lockd nfs_acl rfcomm l2cap bluetooth sunrpc iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad iw_nes ib_ipath ib_mthca dm_multipath scsi_dh raid0 sbs sbshc battery acpi_memhotplug ac parport_pc lp parport mlx4_ib ib_mad ib_core e1000e sr_mod joydev cdrom mlx4_core i5000_edac edac_core shpchp rtc_cmos sg pcspkr rtc_core rtc_lib i2c_i801 i2c_core serio_raw button dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 0, comm: swapper Not tainted 2.6.27_ofa_compil #2

Call Trace:
  <IRQ>  [<ffffffff80235b8d>] warn_on_slowpath+0x51/0x77
  [<ffffffff80229b79>] __wake_up+0x38/0x4f
  [<ffffffff80246d57>] __wake_up_bit+0x28/0x2d
  [<ffffffffa05485af>] rpc_wake_up_task_queue_locked+0x223/0x24b [sunrpc]
  [<ffffffffa054861e>] rpc_wake_up_status+0x47/0x82 [sunrpc]
  [<ffffffff80239c49>] local_bh_enable_ip+0x3c/0x92
  [<ffffffffa0638fd1>] rpcrdma_conn_func+0x6d/0x7c [xprtrdma]
  [<ffffffffa063b316>] rpcrdma_qp_async_error_upcall+0x45/0x5a [xprtrdma]
  [<ffffffffa0294bb3>] mlx4_ib_qp_event+0xf9/0x100 [mlx4_ib]
  [<ffffffff802443da>] __queue_work+0x22/0x32
  [<ffffffffa01fc5d4>] mlx4_qp_event+0x8a/0xad [mlx4_core]
  [<ffffffffa01f50a5>] mlx4_eq_int+0x55/0x291 [mlx4_core]
  [<ffffffffa01f52f0>] mlx4_msi_x_interrupt+0xf/0x16 [mlx4_core]
  [<ffffffff802624f4>] handle_IRQ_event+0x25/0x53
  [<ffffffff80263c0a>] handle_edge_irq+0xe3/0x123
  [<ffffffff8020e907>] do_IRQ+0xf1/0x15e
  [<ffffffff8020c381>] ret_from_intr+0x0/0xa
  <EOI>  [<ffffffffa0549c3e>] nul_marshal+0x0/0x20 [sunrpc]
  [<ffffffff80212474>] mwait_idle+0x41/0x45
  [<ffffffff8020abdf>] cpu_idle+0x7e/0x9c

---[ end trace 5cc994fbe7e141af ]---
rpcrdma: connection to 192.168.0.215:2050 closed (-103)
rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 5 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)


Thanks,

Diego

Vu Pham wrote:
Celine Bourde wrote:
We have still the same problem, even changing the registration method.

mount doesn't reply and this is the output of dmesg on client:

rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16
rpcrdma: connection to 192.168.0.215:2050 closed (-103)
rpcrdma: connection to 192.168.0.215:2050 on mlx4_0, memreg 6 slots 32 ird 16

--- message truncated ---




_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to