Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"

Hans van Kranenburg Sun, 22 Jun 2014 16:33:26 -0700

Hi,

On 06/22/2014 10:19 AM, Martin George wrote:


So firstly, the question arises why your kernel marked all paths as
failed when you hit this error. This actually resembles the old Linux
behavior where for a device error such as a MEDIUM ERROR, it gets
retried on all paths available to the LUN, all which result in the same
error, and hence all paths get marked as failed. This was addressed with
the upstream patch at
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=63583cca745f440167bf27877182dc13e19d4bcf,
where more fine-grained error handling is now available.

Yes, it retries on all paths. The kernel version (3.2.57) which is usedin my case already includes the changes mentioned above.

With this,
device errors such as MEDIUM ERROR are no longer retried since it treats
such errors as permanent errors. That makes me suspect your kernel is
already missing some of the key patches from the upstream kernel in
context with this error handling. And given that UNMAP has also been a
relatively new feature which underwent several upstream revisions to get
to the current stable state, it would be prudent for you to check if
your kernel is up-to-date with its SCSI & UNMAP handling.

Currently I'm not able to reproduce the error (getting this iSCSIresponse) I see in production after re-creating a very similar testsetup using same hardware and software that is failing on me, which is abit confusing. :||

So, even worse, I'm not convinced that the actual problem is a linuxkernel problem yet. Why is my NetApp filer sending a MEDIUM ERROR"Incompatible medium installed" to me anyway in the other case?

The latest kernel code only prevents (afaics) the retry in a smallsubset of cases, which does not include an asc of 0x30 INCOMPATIBLEMEDIUM INSTALLED.


  case MEDIUM_ERROR:
      if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
          sshdr.asc == 0x13 || /* AMNF DATA FIELD */
          sshdr.asc == 0x14) { /* RECORD NOT FOUND */
          set_host_byte(scmd,DID_MEDIUM_ERROR);
          return SUCCESS;
      }
      return NEEDS_RETRY;

That said, it is indeed strange that you hit a MEDIUM ERROR in the first
place, when using UNMAP. As described above, that's a device error. So
does this fail even for other commands such as a regular write (you
could try this with dd) or even a simple TUR command (like say using
sg_turs -v /dev/mpathX)?


# sg_turs -v /dev/mapper/mpath_scylla0
    test unit ready cdb: 00 00 00 00 00 00

The UNMAP is the only command that causes the failure. As long as I donot cause an UNMAP to be sent, by doing mkfs.ext4 without -E nodiscard,doing a mkfs.btrfs without preventing discard or issuing an fstrimcommand, this multipathed lvm on iscsi handles millions of iscsi writeand read ops every day in production just fine. If an UNMAP is sent, itmakes all iSCSI storage on a physical server hang, as seen before.

Today I played around a bit in my test environment (where the failuredoes not occur yet), also tcpdumping the iSCSI traffic, viewing itafterwards using wireshark, and reading about the SCSI specs. That's avery interesting way to learn more about what I'm talking about here. :-)

If there's no obvious way to be found to trigger the same error in thetest environment, I think I'm going to propose to trigger the same againwhile having the test physical server attached to the production luns.From the past occurance, I know that if the only thing that breaks isthe storage connection on the physical server that executes the UNMAP.It's still not the most reassuring choice, but a kind of a calculated risk.

If that's possible I can do a couple of tcpdumps on the iscsi andblktrace dumps to capture what's going on and post them here. Doing sowill prove whether the SCSI error was actually being sent by the NetAppdevice or not.


--
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | [email protected] | www.mendix.com


--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"

Reply via email to