Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Bernd Schubert

On 07/30/2013 11:20 PM, Nix wrote:

On 30 Jul 2013, Bernd Schubert told this:


On 07/30/2013 02:56 AM, Nix wrote:

On 30 Jul 2013, Douglas Gilbert outgrape:


Please supply the information that Martin Petersen asked
for.


Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.


Unless there is very simple fix the commit should reverted, imho. It
would better then to remove write-same support from the md-layer.


I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.



I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with 
lazy init it also will happen after mounting the file system, while lazy 
init is running (inode zeroing).



Cheers,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Nix
On 1 Aug 2013, Bernd Schubert verbalised:

 On 07/30/2013 11:20 PM, Nix wrote:
 On 30 Jul 2013, Bernd Schubert told this:

 On 07/30/2013 02:56 AM, Nix wrote:
 On 30 Jul 2013, Douglas Gilbert outgrape:

 Please supply the information that Martin Petersen asked
 for.

 Did it in private IRC (the advantage of working for the same division of
 the same company!)

 I didn't realise the original fix was actually implemented to allow
 Bernd, with a different Areca controller, to boot... obviously, in that
 situation, reversion is wrong, since that would just replace one won't-
 boot situation with another.

 Unless there is very simple fix the commit should reverted, imho. It
 would better then to remove write-same support from the md-layer.

 I'm not using md on that machine, just LVM. Our suspicion is that ext4
 is doing a WRITE SAME for some reason.

 I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
 lazy init it also will happen after mounting the file system, while
 lazy init is running (inode zeroing).

Well, it'll happen the first few times you mount the fs. If your fs is
years old (as mine are) the inode tables will probably have been
initialized by now!

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-08-01 Thread Bernd Schubert

On 08/01/2013 06:04 PM, Nix wrote:

On 1 Aug 2013, Bernd Schubert verbalised:


On 07/30/2013 11:20 PM, Nix wrote:

On 30 Jul 2013, Bernd Schubert told this:


On 07/30/2013 02:56 AM, Nix wrote:

On 30 Jul 2013, Douglas Gilbert outgrape:


Please supply the information that Martin Petersen asked
for.


Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.


Unless there is very simple fix the commit should reverted, imho. It
would better then to remove write-same support from the md-layer.


I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.


I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with
lazy init it also will happen after mounting the file system, while
lazy init is running (inode zeroing).


Well, it'll happen the first few times you mount the fs. If your fs is
years old (as mine are) the inode tables will probably have been
initialized by now!



I'm frequently doing tests with millions of files and reformating is 
ways faster than deleting the all these files.

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-31 Thread Bernd Schubert
On 07/31/2013 05:15 AM, Martin K. Petersen wrote:
 Bernd == Bernd Schubert bernd.schub...@fastmail.fm writes:
 
 Bernd,
 
 Product revision level: R001 
 
 It's clearly not verbatim passthrough...
 
 Bernd Besides the firmware, the difference might be that I'm exporting
 Bernd single disks without any areca-raidset in between.  I can try to
 Bernd confirm that tomorrow, I just need the system as it is till
 Bernd tomorrow noon.
 
 That would be a great data point. I don't have any Areca boards.
 

Just tested it, areca-raidset does not make a difference, but the
firmware version does. After downgrading to 1.46 I have the same issue.

It is getting a bit late for me, but as this a pure development system,
which is also booted over nfs, I can investigate it tomorrow.


Cheers,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Bernd Schubert

On 07/30/2013 01:34 AM, Martin K. Petersen wrote:

Nix == Nix  n...@esperi.org.uk writes:


Bernd,

Nix I can now confirm that reverting this commit causes this problem to
Nix go away, and my machine boots fine again.

Can you please send me the output of sq_inq with your 1.49 firmware?

I made a tweak that allowed Nix to boot but we're trying to find a good
blacklist trigger. And that's tricky given that Areca allows you
manually specify the SCSI model string for each volume...



Sorry it got a bit late today.

Here it is.


(wheezy)fslab1:~# sg_inq -v /dev/sdc
inquiry cdb: 12 00 00 00 24 00
standard INQUIRY:
inquiry cdb: 12 00 00 00 60 00
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  BQue=0
  EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
  [RelAdr=0]  WBus16=1  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x3  QAS=0  IUS=0]
length=96 (0x60)   Peripheral device type: disk
 Vendor identification: Hitachi
 Product identification: HDS724040KLSA80
 Product revision level: R001
inquiry cdb: 12 01 00 00 fc 00
inquiry cdb: 12 01 80 00 fc 00
 Unit serial number: KRFS2CRAHXJZVD


Besides the firmware, the difference might be that I'm exporting single 
disks without any areca-raidset in between.
I can try to confirm that tomorrow, I just need the system as it is till 
tomorrow noon.



Cheers,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Bernd Schubert

On 07/30/2013 02:56 AM, Nix wrote:

On 30 Jul 2013, Douglas Gilbert outgrape:


Please supply the information that Martin Petersen asked
for.


Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.


Unless there is very simple fix the commit should reverted, imho. It 
would better then to remove write-same support from the md-layer.



Cheers,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Nix
On 30 Jul 2013, Bernd Schubert told this:

 On 07/30/2013 02:56 AM, Nix wrote:
 On 30 Jul 2013, Douglas Gilbert outgrape:

 Please supply the information that Martin Petersen asked
 for.

 Did it in private IRC (the advantage of working for the same division of
 the same company!)

 I didn't realise the original fix was actually implemented to allow
 Bernd, with a different Areca controller, to boot... obviously, in that
 situation, reversion is wrong, since that would just replace one won't-
 boot situation with another.

 Unless there is very simple fix the commit should reverted, imho. It
 would better then to remove write-same support from the md-layer.

I'm not using md on that machine, just LVM. Our suspicion is that ext4
is doing a WRITE SAME for some reason.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Nick Alcock
On 30 Jul 2013, Bernd Schubert told this:

 On 07/30/2013 01:34 AM, Martin K. Petersen wrote:
 (wheezy)fslab1:~# sg_inq -v /dev/sdc
 inquiry cdb: 12 00 00 00 24 00
 standard INQUIRY:
 inquiry cdb: 12 00 00 00 60 00
   PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
   [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=2
   SCCS=0  ACC=0  TPGS=0  3PC=0  Protect=0  BQue=0
   EncServ=0  MultiP=0  [MChngr=0]  [ACKREQQ=0]  Addr16=1
   [RelAdr=0]  WBus16=1  Sync=0  Linked=0  [TranDis=0]  CmdQue=1
   [SPI: Clocking=0x3  QAS=0  IUS=0]
 length=96 (0x60)   Peripheral device type: disk
  Vendor identification: Hitachi
  Product identification: HDS724040KLSA80
  Product revision level: R001
 inquiry cdb: 12 01 00 00 fc 00
 inquiry cdb: 12 01 80 00 fc 00
  Unit serial number: KRFS2CRAHXJZVD

 Besides the firmware, the difference might be that I'm exporting single disks 
 without any areca-raidset in between.
 I can try to confirm that tomorrow, I just need the system as it is till 
 tomorrow noon.

Aaah. Yeah, it looks like in JBOD mode it's just passing things straight
on to the disk: that vendor ID is a dead giveaway. For all I know my
earlier firmware does the same, but for obvious reasons I can't really
test that! Quite possibly it's passing *everything* on to the disk,
including all SCSI commands, in which case we don't actually know that
your Areca controller supports the VPD page we thought it did: quite
possibly only this underlying disk does.

You can get a degree of info on the underlying disks in the array even
if it's in RAID mode -- smartctl does it, for instance -- but it takes
Areca-specific code and chattering to the sg devices directly. I bet
that in JBOD mode, the sg device is the only exposure the controller has
to the world, and *all* the /dev/sd* devices are just passthroughs.

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Martin K. Petersen
 Bernd == Bernd Schubert bernd.schub...@fastmail.fm writes:

Bernd,

 Product revision level: R001 

It's clearly not verbatim passthrough...

Bernd Besides the firmware, the difference might be that I'm exporting
Bernd single disks without any areca-raidset in between.  I can try to
Bernd confirm that tomorrow, I just need the system as it is till
Bernd tomorrow noon.

That would be a great data point. I don't have any Areca boards.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-30 Thread Martin K. Petersen
 Nick == Nick Alcock nick.alc...@esperi.org.uk writes:

Nick in which case we don't actually know that your Areca controller
Nick supports the VPD page we thought it did: quite possibly only this
Nick underlying disk does.

The ATA Information VPD page is created by the SCSI-ATA Translation
layer. The controller firmware in this case.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Bernd Schubert

Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:

My server's ARC-1210 has been working fine for years, but when I
upgraded from 3.10.1, it started failing:

Instead of

[0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06  Model ARC-1210
[0.804028] scsi0 : Areca SATA Host Adapter RAID Controller
  Driver Version 1.20.00.15 2010/08/05
[...]

[4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[4.115399] sd 7:0:0:1: [sdd] No Caching mode page present
[4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[4.118081]  sdd: sdd1
[4.124363] sd 7:0:0:1: [sdd] No Caching mode page present
[4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through
[4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk

I now see (timestamps and some of the right edge chopped off because not
captured on my camera, no netconsole as this machine has all my storage
and is my loghost, and with this bug it can't get at any of that
storage).

sd 7:0:0:1: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sdd] No Caching mode page present
sd 7:0:0:1: [sdd] Assuming drive cache: write through
  sdd: sdd1
sd 7:0:0:1: [sdd] No Caching mode page present
sd 7:0:0:1: [sdd] Assuming drive cache: write through
sd 7:0:0:1: [sdd] Attached SCSI removable disk
arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr: executing bus reset eh.num_resets=0, num_[...]

arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset 
arcmsr0: waiting for hw bus reset return, retry=0
arcmsr0: waiting for hw bus reset return, retry=1
Areca RAID Controller0: F/W V1.46 2009-01-06  Model ARC-1210
arcmsr: scsi  bus reset eh returns with success
[and back to the top of the error messages again, apparently forever,
  not that the machine would be much use without its RAID array even
  if this loop terminated at some point, so I only gave it a couple
  of minutes]

The failure happens precisely at the moment we transition to early
userspace, so presumably userspace I/O is failing (or something related
to raw device access, perhaps, since the first thing it does is a
vgscan).

I haven't bisected yet (sorry, I have work to do which means this
machine must be running right now), but nothing has changed in the
arcmsr controller, nor in SCSI-land excepting

commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
Author: Martin K. Petersen martin.peter...@oracle.com
Date:   Thu Jun 6 22:15:55 2013 -0400

 SCSI: sd: Update WRITE SAME heuristics

so my, admittedly largely baseless, suspicions currently fall there.


Obviously, at this point, this machine has no modules loaded (it has
almost none loaded even when fully operational)


I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this 
patch is only in 3.10.3, but not yet in 3.10.1. And I don't think this 
commit can cause your issue at all, a failing heuristics would enable 
WRITE SAME and would cause issues with linux-md, but there shouldn't 
happen anything directly in the scsi-layer.

Which was your last working kernel version?


Thanks,
Bernd

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Nix
On 29 Jul 2013, Bernd Schubert said:

 Hi Nick,

 On 07/29/2013 12:10 PM, Nick Alcock wrote:
 arcmsr0: abort device command of scsi id = 0 lun = 1
 arcmsr0: abort device command of scsi id = 0 lun = 0
 arcmsr: executing bus reset eh.num_resets=0, num_[...]

 arcmsr0: wait 'abort all outstanding command' timeout
 arcmsr0: executing hw bus reset 
 arcmsr0: waiting for hw bus reset return, retry=0
 arcmsr0: waiting for hw bus reset return, retry=1
 Areca RAID Controller0: F/W V1.46 2009-01-06  Model ARC-1210
 arcmsr: scsi  bus reset eh returns with success
 [and back to the top of the error messages again, apparently forever,
   not that the machine would be much use without its RAID array even
   if this loop terminated at some point, so I only gave it a couple
   of minutes]

 The failure happens precisely at the moment we transition to early
 userspace, so presumably userspace I/O is failing (or something related
 to raw device access, perhaps, since the first thing it does is a
 vgscan).

 I haven't bisected yet (sorry, I have work to do which means this
 machine must be running right now), but nothing has changed in the
 arcmsr controller, nor in SCSI-land excepting

 commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
 Author: Martin K. Petersen martin.peter...@oracle.com
 Date:   Thu Jun 6 22:15:55 2013 -0400
[...]
 Obviously, at this point, this machine has no modules loaded (it has
 almost none loaded even when fully operational)

 I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
 patch is only in 3.10.3, but not yet in 3.10.1.

... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
3.10.2.)

 And I don't think this
 commit can cause your issue at all, a failing heuristics would enable
 WRITE SAME and would cause issues with linux-md, but there shouldn't
 happen anything directly in the scsi-layer. Which was your last
 working kernel version?

3.10.1. :)

No changes to arcmsr between those versions... I suspect I'll have to
bisect, which will be a complete pig because every failure means a hard
powerdown of this box. Always-on servers rarely appreciate hard
powerdowns :(

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Bernd Schubert

On 07/29/2013 03:05 PM, Nix wrote:

On 29 Jul 2013, Bernd Schubert said:


Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:

arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr: executing bus reset eh.num_resets=0, num_[...]

arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset 
arcmsr0: waiting for hw bus reset return, retry=0
arcmsr0: waiting for hw bus reset return, retry=1
Areca RAID Controller0: F/W V1.46 2009-01-06  Model ARC-1210
arcmsr: scsi  bus reset eh returns with success
[and back to the top of the error messages again, apparently forever,
   not that the machine would be much use without its RAID array even
   if this loop terminated at some point, so I only gave it a couple
   of minutes]

The failure happens precisely at the moment we transition to early
userspace, so presumably userspace I/O is failing (or something related
to raw device access, perhaps, since the first thing it does is a
vgscan).

I haven't bisected yet (sorry, I have work to do which means this
machine must be running right now), but nothing has changed in the
arcmsr controller, nor in SCSI-land excepting

commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
Author: Martin K. Petersen martin.peter...@oracle.com
Date:   Thu Jun 6 22:15:55 2013 -0400

[...]

Obviously, at this point, this machine has no modules loaded (it has
almost none loaded even when fully operational)


I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
patch is only in 3.10.3, but not yet in 3.10.1.


... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
3.10.2.)


Hmm, indeed that points to this commit. I just don't see what could fail 
there.


Could you try to run these commands with 3.10.1?

# # check if reporting opcodes works
# sg_opcodes -v  -n /dev/sdX

# check ata information page
# sg_vpd --page=0x89 /dev/sdX




 And I don't think this
commit can cause your issue at all, a failing heuristics would enable
WRITE SAME and would cause issues with linux-md, but there shouldn't
happen anything directly in the scsi-layer. Which was your last
working kernel version?


3.10.1. :)


Whoops, sorry, I missed that in your first sentence.



No changes to arcmsr between those versions... I suspect I'll have to
bisect, which will be a complete pig because every failure means a hard
powerdown of this box. Always-on servers rarely appreciate hard
powerdowns :(



Maybe just revert this commit? Helpful would be some scsi logging to see 
which command actually fails. I guess you don't have a serial console?



Thanks,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Martin K. Petersen
 Nick == Nick Alcock n...@esperi.org.uk writes:

Nick My server's ARC-1210 has been working fine for years, but when I
Nick upgraded from 3.10.1, it started failing:

Nick [ 0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06  Model
Nick ARC-1210 [ 0.804028] scsi0 : Areca SATA Host Adapter RAID
Nick Controller
Nick  Driver Version 1.20.00.15 2010/08/05
Nick [...]

Interesting. Please provide the output of:

# sg_inq /dev/sdd
# sg_vpd /dev/sdd
# sg_vpd -p ai /dev/sdd

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Martin K. Petersen
 Bernd == Bernd Schubert bernd.schub...@fastmail.fm writes:

Bernd I tested this patch with ARC-1260 and F/W V1.49, no issues. 

It could be due to the firmware version discrepancy.

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Nix
On 29 Jul 2013, Bernd Schubert spake thusly:

 On 07/29/2013 03:05 PM, Nix wrote:
 On 29 Jul 2013, Bernd Schubert said:
 I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this
 patch is only in 3.10.3, but not yet in 3.10.1.

 ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried
 3.10.2.)

 Hmm, indeed that points to this commit. I just don't see what could fail 
 there.

 Could you try to run these commands with 3.10.1?

 # # check if reporting opcodes works
 # sg_opcodes -v  -n /dev/sdX

 # check ata information page
 # sg_vpd --page=0x89 /dev/sdX

If this might cause the same problem I think I'd better wait until work
is done for the day and the machine is no longer loaded, and can be
rebooted without harm...

 No changes to arcmsr between those versions... I suspect I'll have to
 bisect, which will be a complete pig because every failure means a hard
 powerdown of this box. Always-on servers rarely appreciate hard
 powerdowns :(


 Maybe just revert this commit? Helpful would be some scsi logging to
 see which command actually fails. I guess you don't have a serial
 console?

Not at that stage, no! And, yes, a test revert of this one commit will
be the first thing I try this evening / tomorrow morning (depending on
system load).

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Nix
On 29 Jul 2013, Bernd Schubert spake thusly:
 Could you try to run these commands with 3.10.1?

 # # check if reporting opcodes works
 # sg_opcodes -v  -n /dev/sdX

spindle:/boot# sg_opcodes -v -n /dev/sda
inquiry cdb: 12 00 00 00 24 00
Report Supported Operation Codes cmd: a3 0c 00 00 00 00 00 00 20 00 00 00
Report Supported Operation Codes:  Fixed format, current;  Sense key: Illegal 
Request
 Additional sense: Invalid command operation code
  Info fld=0x0 [0]
  Sense Key Specific: Error in Command byte 3840
Report supported operation codes: operation not supported

(sdb is the same, obviously, since they are both separate RAID volumes
controlled by the same controller.)

 # check ata information page
 # sg_vpd --page=0x89 /dev/sdX

spindle:/boot# sg_vpd --page=0x89 /dev/sda
ATA information VPD page:
fetching VPD page failed

Not very helpful, I know :(

I'll try rebooting into a kernel with that commit reverted next.

Areca controllers appear to be a bit weird: e.g. they needed special
support in smartctl...

 No changes to arcmsr between those versions... I suspect I'll have to
 bisect, which will be a complete pig because every failure means a hard
 powerdown of this box. Always-on servers rarely appreciate hard
 powerdowns :(

 Maybe just revert this commit? Helpful would be some scsi logging to
 see which command actually fails. I guess you don't have a serial
 console?

I could set one up, in theory, but the problem is that all my machines
are rather dependent on my NFS-mounted $HOME. Guess where it's mounted
from... in any case, the machine has no serial port, so it would have to
be a usb-serial console, and we know exactly how reliable those are :/

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Martin K. Petersen
 Nix == Nix  n...@esperi.org.uk writes:

Nix spindle:/boot# sg_vpd --page=0x89 /dev/sda ATA information VPD
Nix page: fetching VPD page failed

Please add -v

I'll also need the output of:

# sg_vpd -vl


Nix I'll try rebooting into a kernel with that commit reverted next.

Doesn't matter as far as the sg commands are concerned...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Nix
On 29 Jul 2013, Bernd Schubert uttered the following:

 On 07/29/2013 03:05 PM, Nix wrote:
 On 29 Jul 2013, Bernd Schubert said:

 Hi Nick,

 On 07/29/2013 12:10 PM, Nick Alcock wrote:
 arcmsr0: abort device command of scsi id = 0 lun = 1
 arcmsr0: abort device command of scsi id = 0 lun = 0
 arcmsr: executing bus reset eh.num_resets=0, num_[...]

 arcmsr0: wait 'abort all outstanding command' timeout
 arcmsr0: executing hw bus reset 
 arcmsr0: waiting for hw bus reset return, retry=0
 arcmsr0: waiting for hw bus reset return, retry=1
 Areca RAID Controller0: F/W V1.46 2009-01-06  Model ARC-1210
 arcmsr: scsi  bus reset eh returns with success
 [and back to the top of the error messages again, apparently forever,
not that the machine would be much use without its RAID array even
if this loop terminated at some point, so I only gave it a couple
of minutes]

 The failure happens precisely at the moment we transition to early
 userspace, so presumably userspace I/O is failing (or something related
 to raw device access, perhaps, since the first thing it does is a
 vgscan).

 I haven't bisected yet (sorry, I have work to do which means this
 machine must be running right now), but nothing has changed in the
 arcmsr controller, nor in SCSI-land excepting

 commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
 Author: Martin K. Petersen martin.peter...@oracle.com
 Date:   Thu Jun 6 22:15:55 2013 -0400

I can now confirm that reverting this commit causes this problem to go
away, and my machine boots fine again.

Please revert (and figure out what is wrong so that 3.11 doesn't
implode in the same way? I'm happy to assist...)

(My apologies if a 'please revert' from someone bitten by a stable
regression isn't adequate reason to revert the thing: I've never been
quite sure who should report regressions in stable patches to Greg. It
should at least be *evidence*. So here's my it crashed and now it
doesn't evidence. :} )

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Martin K. Petersen
 Nix == Nix  n...@esperi.org.uk writes:

Bernd,

Nix I can now confirm that reverting this commit causes this problem to
Nix go away, and my machine boots fine again.

Can you please send me the output of sq_inq with your 1.49 firmware?

I made a tweak that allowed Nix to boot but we're trying to find a good
blacklist trigger. And that's tricky given that Areca allows you
manually specify the SCSI model string for each volume...

-- 
Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Douglas Gilbert

On 13-07-29 05:09 PM, Nix wrote:

On 29 Jul 2013, Bernd Schubert uttered the following:


On 07/29/2013 03:05 PM, Nix wrote:

On 29 Jul 2013, Bernd Schubert said:


Hi Nick,

On 07/29/2013 12:10 PM, Nick Alcock wrote:

arcmsr0: abort device command of scsi id = 0 lun = 1
arcmsr0: abort device command of scsi id = 0 lun = 0
arcmsr: executing bus reset eh.num_resets=0, num_[...]

arcmsr0: wait 'abort all outstanding command' timeout
arcmsr0: executing hw bus reset 
arcmsr0: waiting for hw bus reset return, retry=0
arcmsr0: waiting for hw bus reset return, retry=1
Areca RAID Controller0: F/W V1.46 2009-01-06  Model ARC-1210
arcmsr: scsi  bus reset eh returns with success
[and back to the top of the error messages again, apparently forever,
not that the machine would be much use without its RAID array even
if this loop terminated at some point, so I only gave it a couple
of minutes]

The failure happens precisely at the moment we transition to early
userspace, so presumably userspace I/O is failing (or something related
to raw device access, perhaps, since the first thing it does is a
vgscan).

I haven't bisected yet (sorry, I have work to do which means this
machine must be running right now), but nothing has changed in the
arcmsr controller, nor in SCSI-land excepting

commit 98dcc2946adbe4349ef1ef9b99873b912831edd4
Author: Martin K. Petersen martin.peter...@oracle.com
Date:   Thu Jun 6 22:15:55 2013 -0400


I can now confirm that reverting this commit causes this problem to go
away, and my machine boots fine again.

Please revert (and figure out what is wrong so that 3.11 doesn't
implode in the same way? I'm happy to assist...)


Hi,
Please supply the information that Martin Petersen asked
for.

I just examined a more recent Areca SAS RAID controller
and would describe it as the SCSI device from hell. One solution
to this problem is to modify the arcmsr driver so it returns
a more consistent set of lies to the management SCSI commands that
Martin is asking about.

Doug Gilbert

--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition

2013-07-29 Thread Nix
On 30 Jul 2013, Douglas Gilbert outgrape:

 Please supply the information that Martin Petersen asked
 for.

Did it in private IRC (the advantage of working for the same division of
the same company!)

I didn't realise the original fix was actually implemented to allow
Bernd, with a different Areca controller, to boot... obviously, in that
situation, reversion is wrong, since that would just replace one won't-
boot situation with another.

It looks like a solution is possible that will let us boot *both* my
controller (with its old 2009-era firmware) *and* his. We just have
to let Martin implement it. Give him time, I only got a successful
boot out of it an hour ago :)

 I just examined a more recent Areca SAS RAID controller
 and would describe it as the SCSI device from hell. One solution
 to this problem is to modify the arcmsr driver so it returns
 a more consistent set of lies to the management SCSI commands that
 Martin is asking about.

I can't help notice that something is skewy in its error handling, too.
When the controller errors, even resetting the bus doesn't seem to be
enough to bring it back :/ I've seen errors from it before which did
*not* lead to it imploding forever, but this is apparently not one such.

Certainly Areca-the-company has... issues with communication with the
community (i.e., they don't). A shame I didn't know that before I bought
the controller and made all my data completely dependent on it, really.
Shame, the controller otherwise works very well (fast, and has coped
with a disk failure with aplomb).

-- 
NULL  (void)
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html