You're getting multipathing errors, which means it's most likely not a filesystem-level issue. See if you can get the logs from the storage array as well, there might be some detail there as to what is happening.
Can you check your logs and determine if it's a single connection that is always failing? If so, can you try replacing the cable and see if that clears it up? Next would be checking to make sure that the source and destination SAS ports are good. -Ben Evans From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>> on behalf of Angelo Cavalcanti <acrribe...@gmail.com<mailto:acrribe...@gmail.com>> Date: Monday, March 28, 2016 at 10:01 AM To: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" <lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>> Subject: [lustre-discuss] Error Lustre/multipath/storage Dear all, We're having trouble with a lustre 2.5.3 implementation. This is our setup: * One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition of 2TB (que tipo de hd?) * Two OSS/OST in a active/active HA with pacemaker. Both are connected to a storage via SAS. * One SGI Infinite Storage IS5600 with two raid-6 backed volume groups. Each group has two volumes, each volume has 15TB capacity. Volumes are recognized by OSSs as multipath devices, each voulme has 4 paths. Volumes were created with a GPT partition table and a single partition. Volume partitions were then formatted as OSTs with the following command: # mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E stride=128,stripe_width=1024" --mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1 --mgsnode=10.149.0.153@o2ib1 --index=0 --servicenode=10.149.0.151@o2ib1 --servicenode=10.149.0.152@o2ib1 /dev/mapper/360080e500029eaec0000012656951fcap1 Testing with bonnie++ in a client with the below command: $ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f -b -u vhpc No problem creating files inside the lustre mount point, but *rewriting* the same files results in the errors below: Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 3 Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00 Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128. Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00 Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192. Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00 Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00 Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64. Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00 Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0. Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00 Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi - rdac checker reports path is up Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 4 Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07 18 22 00 18 00 00 Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128. Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06 d8 22 00 20 00 00 Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 3 Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 2 Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 1 Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: Entering recovery mode: max_retries=30 Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining active paths: 0 Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: Entering recovery mode: max_retries=30 Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi - rdac checker reports path is up Multipath configuration ( /etc/multipath.conf ) is below, and is correct according to the vendor (SGI). defaults { user_friendly_names no } blacklist { wwid "*" } blacklist_exceptions { wwid "360080e500029eaec0000012656951fca" wwid "360080e500029eaec0000012956951fcb" wwid "360080e500029eaec0000012c56951fcb" wwid "360080e500029eaec0000012f56951fcb" } devices { device { vendor "SGI" product "IS.*" product_blacklist "Universal Xport" getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" prio "rdac" features "2 pg_init_retries 50" hardware_handler "1 rdac" path_grouping_policy "group_by_prio" failback "immediate" rr_weight "uniform" no_path_retry 30 retain_attached_hw_handler "yes" detect_prio "yes" #rr_min_io 1000 path_checker "rdac" #selector "round-robin 0" #polling_interval 10 } } multipaths { multipath { wwid "360080e500029eaec0000012656951fca" } multipath { wwid "360080e500029eaec0000012956951fcb" } multipath { wwid "360080e500029eaec0000012c56951fcb" } multipath { wwid "360080e500029eaec0000012f56951fcb" } } Many many combinations of OST formating options were tried, internal and external journaling … But the same errors persist. The same bonnie++ tests were repeated on all volumes of the storage using only ext4, all successful. Regards, Angelo
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org