Bug#740701: marked as done (multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed")

Debian Bug Tracking System Sat, 29 Nov 2014 17:15:24 -0800

Your message dated Sun, 30 Nov 2014 02:02:11 +0100
with message-id <[email protected]>
and subject line Re: [Netapp-Linux-Community] Bug#740701: multipath-tools: mkfs 
fails "Add. Sense: Incompatible medium installed"
has caused the Debian Bug report #740701,
regarding multipath-tools: mkfs fails "Add. Sense: Incompatible medium 
installed"
to be marked as done.


This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
740701: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701
Debian Bug Tracking System
Contact [email protected] with problems

--- Begin Message ---

Package: multipath-tools
Version: 0.4.9+git0.9c3c5172-1
Severity: important

Dear Maintainer,

* What led up to the situation?

We use multipath connected devices on our OpenAFS servers.  We are in the 
process of attempting to upgrade the systems to wheezy.  Our normal 
procedure is to completely evacuate a server and rebuild it from scratch 
and reconfigure the storage.
     
* What exactly did you do (or not do) that was effective (or
  ineffective)?

On the systems we have tested:

  - mkfs on squeeze with a 2.6.32-5-amd64 kernel SUCCEEDS
  - mkfs on squeeze with a 3.2 backport kernel FAILS
  - mkfs on wheezy FAILS
  - mkfs on jessie FAILS

There is one exception to the above list.  We do have one wheezy system 
where we are able to mkfs file systems on multipath served devices.  These 
systems are built using FAI with the configuration be supplied by puppet.  
We have not been able to figure out what is unique about this system.

The current system we have available to test with has a qlogic HBA card.
We have also tested with Emulex cards with the same results.

When mkfs fails multipathd starts spewing errors on the console:

afssvr20:/etc/rsyslog.d# mkfs -t ext4 /dev/mapper/mpathb-part1
mke2fs 1.42.9 (4-Feb-2014)
[11712.731701] sd 3:0:0:20: [sdd] Unhandled sense code
[11712.800335] sd 3:0:0:20: [sdd]
[11712.838877] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11712.905537] sd 3:0:0:20: [sdd]
[11712.944077] Sense Key : Medium Error [current]
[11712.998246] sd 3:0:0:20: [sdd]
[11713.036788] Add. Sense: Incompatible medium installed
[11713.097195] sd 3:0:0:20: [sdd] CDB:
[11713.139903] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
[11713.212987] end_request: I/O error, dev sdd, sector 128
[11713.276032] device-mapper: multipath: Failing path 8:48.
[11713.340113] sd 4:0:1:20: [sdc] Unhandled sense code
[11713.398455] sd 4:0:1:20: [sdc]
[11713.437003] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11713.503669] sd 4:0:1:20: [sdc]
[11713.542218] Sense Key : Medium Error [current]
[11713.596408] sd 4:0:1:20: [sdc]
[11713.634948] Add. Sense: Incompatible medium installed
[11713.695357] sd 4:0:1:20: [sdc] CDB:
[11713.738077] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
[11713.811160] end_request: I/O error, dev sdc, sector 128
[11713.874213] device-mapper: multipath: Failing path 8:32.
[11713.938320] sd 3:0:1:20: [sde] Unhandled sense code
[11713.996662] sd 3:0:1:20: [sde]
[11714.035206] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11714.101868] sd 3:0:1:20: [sde]
[11714.140413] Sense Key : Medium Error [current]
[11714.194590] sd 3:0:1:20: [sde]
[11714.233136] Add. Sense: Incompatible medium installed
[11714.293553] sd 3:0:1:20: [sde] CDB:
[11714.336268] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
[11714.409350] end_request: I/O error, dev sde, sector 128
[11714.472357] device-mapper: multipath: Failing path 8:64.
[11714.536430] sd 4:0:0:20: [sdb] Unhandled sense code
[11714.594768] sd 4:0:0:20: [sdb]
[11714.633306] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11714.699970] sd 4:0:0:20: [sdb]
[11714.738506] Sense Key : Medium Error [current]
[11714.792681] sd 4:0:0:20: [sdb]
[11714.831221] Add. Sense: Incompatible medium installed
[11714.891625] sd 4:0:0:20: [sdb] CDB:
[11714.934333] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
[11715.007395] end_request: I/O error, dev sdb, sector 128
[11715.070374] device-mapper: multipath: Failing path 8:16.
[11715.134475] device-mapper: multipath: Failing path 8:48.
[11715.134690] sd 4:0:1:20: [sdc] Unhandled sense code
[11715.134691] sd 4:0:1:20: [sdc]
[11715.134692] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
.....

Here is the output from multipath -ll -v3

Mar 03 23:16:44 | loading /lib/multipath/libcheckdirectio.so checker
Mar 03 23:16:44 | loading /lib/multipath/libprioconst.so prioritizer
Mar 03 23:16:44 | sda: udev property ID_WWN whitelisted
Mar 03 23:16:44 | sda: not found in pathvec
Mar 03 23:16:44 | sda: mask = 0x25
Mar 03 23:16:44 | sda: dev_t = 8:0
Mar 03 23:16:44 | sda: size = 142082048
Mar 03 23:16:44 | sda: vendor = DELL    
Mar 03 23:16:44 | sda: product = PERC 6/i        
Mar 03 23:16:44 | sda: rev = 1.21
Mar 03 23:16:44 | sda: h:b:t:l = 0:2:0:0
Mar 03 23:16:44 | sda: path state = running

Mar 03 23:16:44 | sda: get_state
Mar 03 23:16:44 | sda: path checker = directio (internal default)
Mar 03 23:16:44 | sda: checker timeout = 90 ms (sysfs setting)
Mar 03 23:16:44 | directio: starting new request
Mar 03 23:16:44 | directio: io finished 4096/0
Mar 03 23:16:44 | sda: state = up
Mar 03 23:16:44 | sdc: udev property ID_WWN whitelisted
Mar 03 23:16:44 | sdc: not found in pathvec
Mar 03 23:16:44 | sdc: mask = 0x25
Mar 03 23:16:44 | sdc: dev_t = 8:32
Mar 03 23:16:44 | sdc: size = 11488198656
Mar 03 23:16:44 | sdc: vendor = NETAPP  
Mar 03 23:16:44 | sdc: product = LUN             
Mar 03 23:16:44 | sdc: rev = 811a
Mar 03 23:16:44 | sdc: h:b:t:l = 3:0:0:20
Mar 03 23:16:44 | SCSI target 3:0:0 -> FC rport 3:0-0
Mar 03 23:16:44 | sdc: tgt_node_name = 0x500a09808d44a0ea
Mar 03 23:16:44 | sdc: path state = running

Mar 03 23:16:44 | sdc: get_state
Mar 03 23:16:44 | loading /lib/multipath/libchecktur.so checker
Mar 03 23:16:44 | sdc: path checker = tur (controller setting)
Mar 03 23:16:44 | sdc: checker timeout = 30 ms (sysfs setting)
Mar 03 23:16:44 | sdc: state = up
Mar 03 23:16:44 | sde: udev property ID_WWN whitelisted
Mar 03 23:16:44 | sde: not found in pathvec
Mar 03 23:16:44 | sde: mask = 0x25
Mar 03 23:16:44 | sde: dev_t = 8:64
Mar 03 23:16:44 | sde: size = 11488198656
Mar 03 23:16:44 | sde: vendor = NETAPP  
Mar 03 23:16:44 | sde: product = LUN             
Mar 03 23:16:44 | sde: rev = 811a
Mar 03 23:16:44 | sde: h:b:t:l = 3:0:1:20
Mar 03 23:16:44 | SCSI target 3:0:1 -> FC rport 3:0-1
Mar 03 23:16:44 | sde: tgt_node_name = 0x500a09808d44a0ea
Mar 03 23:16:44 | sde: path state = running
Mar 03 23:16:44 | sde: get_state
Mar 03 23:16:44 | sde: path checker = tur (controller setting)
Mar 03 23:16:44 | sde: checker timeout = 30 ms (sysfs setting)
Mar 03 23:16:44 | sde: state = up
Mar 03 23:16:44 | sdb: udev property ID_WWN whitelisted
Mar 03 23:16:44 | sdb: not found in pathvec
Mar 03 23:16:44 | sdb: mask = 0x25
Mar 03 23:16:44 | sdb: dev_t = 8:16
Mar 03 23:16:44 | sdb: size = 11488198656
Mar 03 23:16:44 | sdb: vendor = NETAPP  
Mar 03 23:16:44 | sdb: product = LUN             
Mar 03 23:16:44 | sdb: rev = 811a
Mar 03 23:16:44 | sdb: h:b:t:l = 4:0:0:20
Mar 03 23:16:44 | SCSI target 4:0:0 -> FC rport 4:0-0
Mar 03 23:16:44 | sdb: tgt_node_name = 0x500a09808d44a0ea
Mar 03 23:16:44 | sdb: path state = running

Mar 03 23:16:44 | sdb: get_state
Mar 03 23:16:44 | sdb: path checker = tur (controller setting)
Mar 03 23:16:44 | sdb: checker timeout = 30 ms (sysfs setting)
Mar 03 23:16:44 | sdb: state = up
Mar 03 23:16:44 | sdd: udev property ID_WWN whitelisted
Mar 03 23:16:44 | sdd: not found in pathvec
Mar 03 23:16:44 | sdd: mask = 0x25
Mar 03 23:16:44 | sdd: dev_t = 8:48
Mar 03 23:16:44 | sdd: size = 11488198656
Mar 03 23:16:44 | sdd: vendor = NETAPP  
Mar 03 23:16:44 | sdd: product = LUN             
Mar 03 23:16:44 | sdd: rev = 811a
Mar 03 23:16:44 | sdd: h:b:t:l = 4:0:1:20
Mar 03 23:16:44 | SCSI target 4:0:1 -> FC rport 4:0-1
Mar 03 23:16:44 | sdd: tgt_node_name = 0x500a09808d44a0ea
Mar 03 23:16:44 | sdd: path state = running

Mar 03 23:16:44 | sdd: get_state
Mar 03 23:16:44 | sdd: path checker = tur (controller setting)
Mar 03 23:16:44 | sdd: checker timeout = 30 ms (sysfs setting)
Mar 03 23:16:44 | sdd: state = up
Mar 03 23:16:44 | sr0: blacklisted, udev property missing
Mar 03 23:16:44 | dm-0: blacklisted, udev property missing
Mar 03 23:16:44 | dm-1: blacklisted, udev property missing
Mar 03 23:16:44 | dm-2: blacklisted, udev property missing
===== paths list =====
uuid hcil     dev dev_t pri dm_st chk_st vend/prod/rev             dev_st 
     0:2:0:0  sda 8:0   -1  undef ready  DELL    ,PERC 6/i         running
     3:0:0:20 sdc 8:32  -1  undef ready  NETAPP  ,LUN              running
     3:0:1:20 sde 8:64  -1  undef ready  NETAPP  ,LUN              running
     4:0:0:20 sdb 8:16  -1  undef ready  NETAPP  ,LUN              running
Mar 03 23:16:44 | params = 4 queue_if_no_path pg_init_retries 50 retain_attached
_hw_handler 0 2 1 service-time 0 2 2 8:64 1 1 8:16 1 1 service-time 0 2 2 8:32 1
 1 8:48 1 1 
Mar 03 23:16:44 | status = 2 0 0 0 2 1 A 0 2 2 8:64 A 0 0 1 8:16 A 0 0 1 E 0 2 2
 8:32 A 0 0 1 8:48 A 0 0 1 
Mar 03 23:16:44 | mpathb: disassemble map [4 queue_if_no_path pg_init_retries 50
 retain_attached_hw_handler 0 2 1 service-time 0 2 2 8:64 1 1 8:16 1 1 service-t
ime 0 2 2 8:32 1 1 8:48 1 1 ]
Mar 03 23:16:44 | sde: mask = 0x8
Mar 03 23:16:44 | sde: path state = running

Mar 03 23:16:44 | sde: detect_prio = 2 (controller default)
Mar 03 23:16:44 | loading /lib/multipath/libprioalua.so prioritizer
Mar 03 23:16:44 | sde: prio = alua (detected setting)
Mar 03 23:16:44 | reported target port group is 0
Mar 03 23:16:44 | aas = 00 [active/optimized]
Mar 03 23:16:44 | sde: alua prio = 50
Mar 03 23:16:44 | sdb: mask = 0x8
Mar 03 23:16:44 | sdb: path state = running

Mar 03 23:16:44 | sdb: detect_prio = 2 (controller default)
Mar 03 23:16:44 | sdb: prio = alua (detected setting)
Mar 03 23:16:44 | reported target port group is 0
Mar 03 23:16:44 | aas = 00 [active/optimized]
Mar 03 23:16:44 | sdb: alua prio = 50
Mar 03 23:16:44 | sdc: mask = 0x8
Mar 03 23:16:44 | sdc: path state = running

Mar 03 23:16:44 | sdc: detect_prio = 2 (controller default)
Mar 03 23:16:44 | sdc: prio = alua (detected setting)
Mar 03 23:16:44 | reported target port group is 1
Mar 03 23:16:44 | aas = 01 [active/non-optimized]
Mar 03 23:16:44 | sdc: alua prio = 10
Mar 03 23:16:44 | sdd: mask = 0x8
Mar 03 23:16:44 | sdd: path state = running

Mar 03 23:16:44 | sdd: detect_prio = 2 (controller default)
Mar 03 23:16:44 | sdd: prio = alua (detected setting)
Mar 03 23:16:44 | reported target port group is 1
Mar 03 23:16:44 | aas = 01 [active/non-optimized]
Mar 03 23:16:44 | sdd: alua prio = 10
Mar 03 23:16:44 | mpathb: disassemble status [2 0 0 0 2 1 A 0 2 2 8:64 A 0 0 1 8
:16 A 0 0 1 E 0 2 2 8:32 A 0 0 1 8:48 A 0 0 1 ]
mpathb (360a9800064704842534a6c5063513577) dm-0 NETAPP  ,LUN             
size=5.3T features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_han
dle' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| |- 3:0:1:20 sde 8:64 active ready running
| `- 4:0:0:20 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=10 status=enabled
  |- 3:0:0:20 sdc 8:32 active ready running
  `- 4:0:1:20 sdd 8:48 active ready running
Mar 03 23:16:44 | directio checker refcount 1
Mar 03 23:16:44 | tur checker refcount 4
Mar 03 23:16:44 | alua prioritizer refcount 4
Mar 03 23:16:44 | tur checker refcount 3
Mar 03 23:16:44 | alua prioritizer refcount 3
Mar 03 23:16:44 | tur checker refcount 2
Mar 03 23:16:44 | alua prioritizer refcount 2
Mar 03 23:16:44 | tur checker refcount 1
Mar 03 23:16:44 | alua prioritizer refcount 1
Mar 03 23:16:44 | unloading alua prioritizer
Mar 03 23:16:44 | unloading const prioritizer
Mar 03 23:16:44 | unloading tur checker
Mar 03 23:16:44 | unloading directio checker

Thanks for you help,

Bill

-- Package-specific info:
Contents of /etc/multipath.conf:
defaults {
   user_friendly_names yes
}
blacklist {
   devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
   devnode "^hd[a-z]"
   devnode "^cciss.*"
   wwid "36001e4f031414a000f8035050492d102"
}


-- System Information:
Debian Release: jessie/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 3.12-1-amd64 (SMP w/8 CPU cores)
Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968)
Shell: /bin/sh linked to /bin/bash

Versions of packages multipath-tools depends on:
ii  initscripts         2.88dsf-51
ii  kpartx              0.4.9+git0.9c3c5172-1
ii  libaio1             0.3.109-4
ii  libc6               2.17-97
ii  libdevmapper1.02.1  2:1.02.83-2
ii  libgcc1             1:4.8.2-16
ii  libreadline6        6.2+dfsg-0.1
ii  libudev1            204-7
ii  lsb-base            4.1+Debian12
ii  udev                204-7

multipath-tools recommends no packages.

Versions of packages multipath-tools suggests:
pn  multipath-tools-boot  <none>

-- no debconf information

--- End Message ---

--- Begin Message ---
Hi all,
It's already a few months ago I posted the last update here, and thisthread deserves to know how the story ended...
On 09/01/2014 10:30 PM, Hans van Kranenburg wrote:
On 07/31/2014 10:13 PM, Hans van Kranenburg wrote:
  - Now I'm left with my broken NetApp, and I'd like to start using
UNMAP on it... Any comments from netapp people reading this? There must
be some reason why this is happening, and only on this specific lun,
and
not on the test lun, or on several of the other NetApp filer we use.
Yes, the NetApp controllers are returning this MEDIUM ERROR check
condition for some reason. I'd suggest you open a NetApp support ticket
for tracking this.
The reason of the MEDIUM ERROR response was a shortage of disk space inthe volume that holds the lun on which UNMAP requests were being done.
Since we use (as shown earlier in this bug thread) only a single hugevolume with a single huge lun, the lun was created as big as possible inthe volume, there was no space left for ONTAP to use outside of thealready existing LUN to complete this operation. Increasing the volumesize with a moderate amount of GBs solved the issue.
Filesystem          kbytes       used      avail capacity  Mounted on
[...]
/vol/iscsi0/  10999562240 10738868780  260693460      98%  /vol/iscsi0/
Yes, in the end it can be a very simple thing that had been overlookedall the time. :-|
And so I did. The answer from NetApp support is:

"Since the UNMAP command is an host OS related command, it is not
something that we implement in explicitly in Data ONTAP. The
functionality to support the command is something we do have to test
however, and any changes or additions to support it are done in new
releases. This is why you are able to get some functionality out of the
command in 8.1.2. However, it's not fully certified in Data ONTAP until
8.1.3."

So, using UNMAP might work in ONTAP versions before 8.1.3, but it does
not have an official approval stamp by NetApp.

Also see:
- https://kb.netapp.com/support/index?page=content&id=3013806
- https://kb.netapp.com/support/index?page=content&id=3013991

So, although this functionality is reported as 'working' in many cases,
it seems that NetApp itself has fixed some bugs in 8.1.3 and only then
officially started supporting using it.
This one is an interesting one. The way that doing UNMAP was made toperform quite well in the ONTAP 8.1.3 release seems to be achieved byjust ignoring most of the requests. This fixes the terrible performanceimpact that was seen with iSCSI UNMAP in previous versions.
During my conversation with NetApp, I was assured that thisfunctionality should not be used for means of 'securely getting rid ofold data', because the actual execution of the discarding of data isonly a voluntary / 'advisory' operation, as per SBC-3 spec.
iSCSI UNMAP, fstrim, lvm issue_discards, discard mount options etc, allof them can be used to hopefully reclaim some space in a lun, withoutany promises made.
So, even in the Netapp case with lrprz=1 (as seen in sg_readcap -l/dev/sgX), this means that you will only get zeroes back for every blockthat has *actually* been unmapped from the lun, instead of every blockthat was ACKed as successfully handled in the response when asking thefiler to unmap them.
I've been doing tests on space reclamation efficiency with our current8.1.4P1, and I see there's a very interesting difference between sendingunmaps to blocks that have previously been overwritten with zeroes, asopposed to blocks that have been overwritten with random data. In caseof random data, when I do a lvremove (with issue_discards enabled), Idon't almost see any space relamation at all. When issuing it on spacethat I just wrote all-zeroes to, I see a space reclamation of almost100%, if not the full 100%. Both operations take the same amount of timeto complete, as seen from the iscsi initiator.
So, apparently, ONTAP tracks the fact that there are data blocks thatget all-zeroes written to, and it can easily drop them when asked to.
Right now, our idea is to do some upgrades from random 8.1 versions we
run to the latest maintenance version of ONTAP 8.1 (8.1.4P1), which is
not a bad idea in any case.
Well, actually it was...

It turns out that, since the upgrade to ONTAP 8.1.4P1, we have this
issue with UNMAP on all our NetApp filers, instead of only one.
This is also an interesting situation. At first sight, it seemed we hadthe same problem everywhere... When looking into it in more detail, itshowed that the MEDIUM ERROR, or no error at all made place for an"Invalid field in parameter list" error.
The reason for this new error was that the "Maximum unmap LBA count" (asseen in sg_inq -p 0xb0 /dev/sgX) had been changed from 16k to 8k. Duringthe ONTAP upgrade, which was done non-disruptively using controllerfailover, the linux boxes that are attached to the storage systemreconnected their multipath-iscsi. While doing so, the new value of"Maximum unmap LBA count" was not read again, so the linux kernel keptsending requests with >8k LBA in it when doing UNMAP. The result is the"Invalid field in parameter list", which is the correct error, as perspec, although it would have saved me hours if it was called "Invalidvalue" instead of "Invalid field", since the lba count is just a validfield... :o)
So, it took a while to find that... A reboot cycle of all connectedservers (yay, xen with live migration, no impact on customer systems!)fixed this issue, by reconnecting iscsi.
It would probably be a nice thing to fix the linux kernel to re-readsome settings on iscsi-reconnect.
So, major takeaway here is... If you're using NetApp iSCSI, don't useUNMAP as a means to get rid of data in the lun, to be sure you won'tread it back later from the same location.
Overwriting with zeroes before doing UNMAP seems to help a lot. And ifyou're using encryption on top of it... have fun finding data to writethat translates to all-zeroes after encrypting it. :o)
Closing this one now. It's not very much a debian bug, as discussedearlier, but whoever may run into it might find this bugreport thread,and it might be of use.
--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | [email protected] | www.mendix.com
--- End Message ---

Bug#740701: marked as done (multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed")

Reply via email to