You're getting multipathing errors, which means it's most likely not a 
filesystem-level issue.  See if you can get the logs from the storage array as 
well, there might be some detail there as to what is happening.

Can you check your logs and determine if it's a single connection that is 
always failing?  If so, can you try replacing the cable and see if that clears 
it up?  Next would be checking to make sure that the source and destination SAS 
ports are good.

-Ben Evans

From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Angelo Cavalcanti 
<acrribe...@gmail.com<mailto:acrribe...@gmail.com>>
Date: Monday, March 28, 2016 at 10:01 AM
To: "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>
Subject: [lustre-discuss] Error Lustre/multipath/storage


Dear all,

We're having trouble with a lustre 2.5.3 implementation. This is our setup:


  *   One server for MGS/MDS/MDT. MDT is served from a raid-6 backed partition 
of 2TB (que tipo de hd?)


  *   Two OSS/OST in a active/active HA with pacemaker. Both are connected to a 
storage via SAS.


  *   One SGI Infinite Storage IS5600 with two raid-6 backed volume groups. 
Each group has two volumes, each volume has 15TB capacity.


Volumes are recognized by OSSs as multipath devices, each voulme has 4 paths. 
Volumes were created with a GPT partition table and a single partition.


Volume partitions were then formatted as OSTs with the following command:


# mkfs.lustre --replace --reformat --ost --mkfsoptions=" -E 
stride=128,stripe_width=1024" 
--mountfsoptions="errors=remount-ro,extents,mballoc" --fsname=lustre1 
--mgsnode=10.149.0.153@o2ib1 --index=0 --servicenode=10.149.0.151@o2ib1 
--servicenode=10.149.0.152@o2ib1 /dev/mapper/360080e500029eaec0000012656951fcap1


Testing with bonnie++ in a client with the below command:

$ ./bonnie++-1.03e/bonnie++ -m lustre1 -d /mnt/lustre -s 128G:1024k -n 0 -f -b 
-u vhpc


No problem creating files inside the lustre mount point, but *rewriting* the 
same files results in the errors below:


Mar 18 17:46:13 oss01 multipathd: 8:128: mark as failed

Mar 18 17:46:13 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining 
active paths: 3

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06 d8 
22 00 20 00 00

Mar 18 17:46:13 oss01 kernel: __ratelimit: 109 callbacks suppressed

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:128.

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 07 18 
22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:192.

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 1:0:1:0: [sdm] CDB: Read(10): 28 00 00 06 d8 
22 00 20 00 00

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:1:0: [sde] CDB: Read(10): 28 00 00 07 18 
22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:64.

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 07 18 
22 00 18 00 00

Mar 18 17:46:13 oss01 kernel: device-mapper: multipath: Failing path 8:0.

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Unhandled error code

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:13 oss01 kernel: sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 06 d8 
22 00 20 00 00

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi - rdac 
checker reports path is up

Mar 18 17:46:14 oss01 multipathd: 8:128: reinstated

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining 
active paths: 4

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 07 18 
22 00 18 00 00

Mar 18 17:46:14 oss01 kernel: device-mapper: multipath: Failing path 8:128.

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Unhandled error code

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] Result: hostbyte=DID_SOFT_ERROR 
driverbyte=DRIVER_OK

Mar 18 17:46:14 oss01 kernel: sd 1:0:0:0: [sdi] CDB: Read(10): 28 00 00 06 d8 
22 00 20 00 00

Mar 18 17:46:14 oss01 multipathd: 8:128: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining 
active paths: 3

Mar 18 17:46:14 oss01 multipathd: 8:192: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining 
active paths: 2

Mar 18 17:46:14 oss01 multipathd: 8:0: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining 
active paths: 1

Mar 18 17:46:14 oss01 multipathd: 8:64: mark as failed

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: Entering 
recovery mode: max_retries=30

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: remaining 
active paths: 0

Mar 18 17:46:14 oss01 multipathd: 360080e500029eaec0000012656951fca: Entering 
recovery mode: max_retries=30

Mar 18 17:46:19 oss01 multipathd: 360080e500029eaec0000012656951fca: sdi - rdac 
checker reports path is up


Multipath configuration ( /etc/multipath.conf ) is below, and is correct 
according to the vendor (SGI).


defaults {

       user_friendly_names no

}


blacklist {

       wwid "*"

}


blacklist_exceptions {

       wwid "360080e500029eaec0000012656951fca"

       wwid "360080e500029eaec0000012956951fcb"

       wwid "360080e500029eaec0000012c56951fcb"

       wwid "360080e500029eaec0000012f56951fcb"

}


devices {

      device {

        vendor                       "SGI"

        product                      "IS.*"

        product_blacklist            "Universal Xport"

        getuid_callout               "/lib/udev/scsi_id --whitelisted 
--device=/dev/%n"

        prio                         "rdac"

        features                     "2 pg_init_retries 50"

        hardware_handler             "1 rdac"

        path_grouping_policy         "group_by_prio"

        failback                     "immediate"

        rr_weight                    "uniform"

        no_path_retry                30

        retain_attached_hw_handler   "yes"

        detect_prio                  "yes"

        #rr_min_io                   1000

        path_checker                 "rdac"

        #selector                    "round-robin 0"

        #polling_interval            10

      }

}



multipaths {

       multipath {

               wwid "360080e500029eaec0000012656951fca"

       }

       multipath {

               wwid "360080e500029eaec0000012956951fcb"

       }

       multipath {

               wwid "360080e500029eaec0000012c56951fcb"

       }

       multipath {

               wwid "360080e500029eaec0000012f56951fcb"

       }

}


Many many combinations of OST formating options were tried, internal and 
external journaling … But the same errors persist.


The same bonnie++ tests were repeated on all volumes of the storage using only 
ext4, all successful.


Regards,

Angelo
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to