Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/30/2013 11:20 PM, Nix wrote: On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. I'm not using md on that machine, just LVM. Our suspicion is that ext4 is doing a WRITE SAME for some reason. I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with lazy init it also will happen after mounting the file system, while lazy init is running (inode zeroing). Cheers, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 1 Aug 2013, Bernd Schubert verbalised: On 07/30/2013 11:20 PM, Nix wrote: On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. I'm not using md on that machine, just LVM. Our suspicion is that ext4 is doing a WRITE SAME for some reason. I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with lazy init it also will happen after mounting the file system, while lazy init is running (inode zeroing). Well, it'll happen the first few times you mount the fs. If your fs is years old (as mine are) the inode tables will probably have been initialized by now! -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 08/01/2013 06:04 PM, Nix wrote: On 1 Aug 2013, Bernd Schubert verbalised: On 07/30/2013 11:20 PM, Nix wrote: On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. I'm not using md on that machine, just LVM. Our suspicion is that ext4 is doing a WRITE SAME for some reason. I didn't check yet for other cases, mkfs.ext4 does WRITE SAME and with lazy init it also will happen after mounting the file system, while lazy init is running (inode zeroing). Well, it'll happen the first few times you mount the fs. If your fs is years old (as mine are) the inode tables will probably have been initialized by now! I'm frequently doing tests with millions of files and reformating is ways faster than deleting the all these files. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/31/2013 05:15 AM, Martin K. Petersen wrote: Bernd == Bernd Schubert bernd.schub...@fastmail.fm writes: Bernd, Product revision level: R001 It's clearly not verbatim passthrough... Bernd Besides the firmware, the difference might be that I'm exporting Bernd single disks without any areca-raidset in between. I can try to Bernd confirm that tomorrow, I just need the system as it is till Bernd tomorrow noon. That would be a great data point. I don't have any Areca boards. Just tested it, areca-raidset does not make a difference, but the firmware version does. After downgrading to 1.46 I have the same issue. It is getting a bit late for me, but as this a pure development system, which is also booted over nfs, I can investigate it tomorrow. Cheers, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/30/2013 01:34 AM, Martin K. Petersen wrote: Nix == Nix n...@esperi.org.uk writes: Bernd, Nix I can now confirm that reverting this commit causes this problem to Nix go away, and my machine boots fine again. Can you please send me the output of sq_inq with your 1.49 firmware? I made a tweak that allowed Nix to boot but we're trying to find a good blacklist trigger. And that's tricky given that Areca allows you manually specify the SCSI model string for each volume... Sorry it got a bit late today. Here it is. (wheezy)fslab1:~# sg_inq -v /dev/sdc inquiry cdb: 12 00 00 00 24 00 standard INQUIRY: inquiry cdb: 12 00 00 00 60 00 PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2 SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=0 Linked=0 [TranDis=0] CmdQue=1 [SPI: Clocking=0x3 QAS=0 IUS=0] length=96 (0x60) Peripheral device type: disk Vendor identification: Hitachi Product identification: HDS724040KLSA80 Product revision level: R001 inquiry cdb: 12 01 00 00 fc 00 inquiry cdb: 12 01 80 00 fc 00 Unit serial number: KRFS2CRAHXJZVD Besides the firmware, the difference might be that I'm exporting single disks without any areca-raidset in between. I can try to confirm that tomorrow, I just need the system as it is till tomorrow noon. Cheers, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. Cheers, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 02:56 AM, Nix wrote: On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. Unless there is very simple fix the commit should reverted, imho. It would better then to remove write-same support from the md-layer. I'm not using md on that machine, just LVM. Our suspicion is that ext4 is doing a WRITE SAME for some reason. -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 30 Jul 2013, Bernd Schubert told this: On 07/30/2013 01:34 AM, Martin K. Petersen wrote: (wheezy)fslab1:~# sg_inq -v /dev/sdc inquiry cdb: 12 00 00 00 24 00 standard INQUIRY: inquiry cdb: 12 00 00 00 60 00 PQual=0 Device_type=0 RMB=0 version=0x05 [SPC-3] [AERC=0] [TrmTsk=0] NormACA=0 HiSUP=0 Resp_data_format=2 SCCS=0 ACC=0 TPGS=0 3PC=0 Protect=0 BQue=0 EncServ=0 MultiP=0 [MChngr=0] [ACKREQQ=0] Addr16=1 [RelAdr=0] WBus16=1 Sync=0 Linked=0 [TranDis=0] CmdQue=1 [SPI: Clocking=0x3 QAS=0 IUS=0] length=96 (0x60) Peripheral device type: disk Vendor identification: Hitachi Product identification: HDS724040KLSA80 Product revision level: R001 inquiry cdb: 12 01 00 00 fc 00 inquiry cdb: 12 01 80 00 fc 00 Unit serial number: KRFS2CRAHXJZVD Besides the firmware, the difference might be that I'm exporting single disks without any areca-raidset in between. I can try to confirm that tomorrow, I just need the system as it is till tomorrow noon. Aaah. Yeah, it looks like in JBOD mode it's just passing things straight on to the disk: that vendor ID is a dead giveaway. For all I know my earlier firmware does the same, but for obvious reasons I can't really test that! Quite possibly it's passing *everything* on to the disk, including all SCSI commands, in which case we don't actually know that your Areca controller supports the VPD page we thought it did: quite possibly only this underlying disk does. You can get a degree of info on the underlying disks in the array even if it's in RAID mode -- smartctl does it, for instance -- but it takes Areca-specific code and chattering to the sg devices directly. I bet that in JBOD mode, the sg device is the only exposure the controller has to the world, and *all* the /dev/sd* devices are just passthroughs. -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Bernd == Bernd Schubert bernd.schub...@fastmail.fm writes: Bernd, Product revision level: R001 It's clearly not verbatim passthrough... Bernd Besides the firmware, the difference might be that I'm exporting Bernd single disks without any areca-raidset in between. I can try to Bernd confirm that tomorrow, I just need the system as it is till Bernd tomorrow noon. That would be a great data point. I don't have any Areca boards. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Nick == Nick Alcock nick.alc...@esperi.org.uk writes: Nick in which case we don't actually know that your Areca controller Nick supports the VPD page we thought it did: quite possibly only this Nick underlying disk does. The ATA Information VPD page is created by the SCSI-ATA Translation layer. The controller firmware in this case. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: My server's ARC-1210 has been working fine for years, but when I upgraded from 3.10.1, it started failing: Instead of [0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 [0.804028] scsi0 : Areca SATA Host Adapter RAID Controller Driver Version 1.20.00.15 2010/08/05 [...] [4.111770] sd 7:0:0:1: [sdd] Assuming drive cache: write through [4.115399] sd 7:0:0:1: [sdd] No Caching mode page present [4.115401] sd 7:0:0:1: [sdd] Assuming drive cache: write through [4.118081] sdd: sdd1 [4.124363] sd 7:0:0:1: [sdd] No Caching mode page present [4.124601] sd 7:0:0:1: [sdd] Assuming drive cache: write through [4.124867] sd 7:0:0:1: [sdd] Attached SCSI removable disk I now see (timestamps and some of the right edge chopped off because not captured on my camera, no netconsole as this machine has all my storage and is my loghost, and with this bug it can't get at any of that storage). sd 7:0:0:1: [sdd] Assuming drive cache: write through sd 7:0:0:1: [sdd] No Caching mode page present sd 7:0:0:1: [sdd] Assuming drive cache: write through sdd: sdd1 sd 7:0:0:1: [sdd] No Caching mode page present sd 7:0:0:1: [sdd] Assuming drive cache: write through sd 7:0:0:1: [sdd] Attached SCSI removable disk arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen martin.peter...@oracle.com Date: Thu Jun 6 22:15:55 2013 -0400 SCSI: sd: Update WRITE SAME heuristics so my, admittedly largely baseless, suspicions currently fall there. Obviously, at this point, this machine has no modules loaded (it has almost none loaded even when fully operational) I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this patch is only in 3.10.3, but not yet in 3.10.1. And I don't think this commit can cause your issue at all, a failing heuristics would enable WRITE SAME and would cause issues with linux-md, but there shouldn't happen anything directly in the scsi-layer. Which was your last working kernel version? Thanks, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 29 Jul 2013, Bernd Schubert said: Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen martin.peter...@oracle.com Date: Thu Jun 6 22:15:55 2013 -0400 [...] Obviously, at this point, this machine has no modules loaded (it has almost none loaded even when fully operational) I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this patch is only in 3.10.3, but not yet in 3.10.1. ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried 3.10.2.) And I don't think this commit can cause your issue at all, a failing heuristics would enable WRITE SAME and would cause issues with linux-md, but there shouldn't happen anything directly in the scsi-layer. Which was your last working kernel version? 3.10.1. :) No changes to arcmsr between those versions... I suspect I'll have to bisect, which will be a complete pig because every failure means a hard powerdown of this box. Always-on servers rarely appreciate hard powerdowns :( -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 07/29/2013 03:05 PM, Nix wrote: On 29 Jul 2013, Bernd Schubert said: Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen martin.peter...@oracle.com Date: Thu Jun 6 22:15:55 2013 -0400 [...] Obviously, at this point, this machine has no modules loaded (it has almost none loaded even when fully operational) I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this patch is only in 3.10.3, but not yet in 3.10.1. ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried 3.10.2.) Hmm, indeed that points to this commit. I just don't see what could fail there. Could you try to run these commands with 3.10.1? # # check if reporting opcodes works # sg_opcodes -v -n /dev/sdX # check ata information page # sg_vpd --page=0x89 /dev/sdX And I don't think this commit can cause your issue at all, a failing heuristics would enable WRITE SAME and would cause issues with linux-md, but there shouldn't happen anything directly in the scsi-layer. Which was your last working kernel version? 3.10.1. :) Whoops, sorry, I missed that in your first sentence. No changes to arcmsr between those versions... I suspect I'll have to bisect, which will be a complete pig because every failure means a hard powerdown of this box. Always-on servers rarely appreciate hard powerdowns :( Maybe just revert this commit? Helpful would be some scsi logging to see which command actually fails. I guess you don't have a serial console? Thanks, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Nick == Nick Alcock n...@esperi.org.uk writes: Nick My server's ARC-1210 has been working fine for years, but when I Nick upgraded from 3.10.1, it started failing: Nick [ 0.784044] Areca RAID Controller0: F/W V1.46 2009-01-06 Model Nick ARC-1210 [ 0.804028] scsi0 : Areca SATA Host Adapter RAID Nick Controller Nick Driver Version 1.20.00.15 2010/08/05 Nick [...] Interesting. Please provide the output of: # sg_inq /dev/sdd # sg_vpd /dev/sdd # sg_vpd -p ai /dev/sdd -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Bernd == Bernd Schubert bernd.schub...@fastmail.fm writes: Bernd I tested this patch with ARC-1260 and F/W V1.49, no issues. It could be due to the firmware version discrepancy. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 29 Jul 2013, Bernd Schubert spake thusly: On 07/29/2013 03:05 PM, Nix wrote: On 29 Jul 2013, Bernd Schubert said: I tested this patch with ARC-1260 and F/W V1.49, no issues. Also, this patch is only in 3.10.3, but not yet in 3.10.1. ... and I see this problem with 3.10.3 but not 3.10.1. (Haven't tried 3.10.2.) Hmm, indeed that points to this commit. I just don't see what could fail there. Could you try to run these commands with 3.10.1? # # check if reporting opcodes works # sg_opcodes -v -n /dev/sdX # check ata information page # sg_vpd --page=0x89 /dev/sdX If this might cause the same problem I think I'd better wait until work is done for the day and the machine is no longer loaded, and can be rebooted without harm... No changes to arcmsr between those versions... I suspect I'll have to bisect, which will be a complete pig because every failure means a hard powerdown of this box. Always-on servers rarely appreciate hard powerdowns :( Maybe just revert this commit? Helpful would be some scsi logging to see which command actually fails. I guess you don't have a serial console? Not at that stage, no! And, yes, a test revert of this one commit will be the first thing I try this evening / tomorrow morning (depending on system load). -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 29 Jul 2013, Bernd Schubert spake thusly: Could you try to run these commands with 3.10.1? # # check if reporting opcodes works # sg_opcodes -v -n /dev/sdX spindle:/boot# sg_opcodes -v -n /dev/sda inquiry cdb: 12 00 00 00 24 00 Report Supported Operation Codes cmd: a3 0c 00 00 00 00 00 00 20 00 00 00 Report Supported Operation Codes: Fixed format, current; Sense key: Illegal Request Additional sense: Invalid command operation code Info fld=0x0 [0] Sense Key Specific: Error in Command byte 3840 Report supported operation codes: operation not supported (sdb is the same, obviously, since they are both separate RAID volumes controlled by the same controller.) # check ata information page # sg_vpd --page=0x89 /dev/sdX spindle:/boot# sg_vpd --page=0x89 /dev/sda ATA information VPD page: fetching VPD page failed Not very helpful, I know :( I'll try rebooting into a kernel with that commit reverted next. Areca controllers appear to be a bit weird: e.g. they needed special support in smartctl... No changes to arcmsr between those versions... I suspect I'll have to bisect, which will be a complete pig because every failure means a hard powerdown of this box. Always-on servers rarely appreciate hard powerdowns :( Maybe just revert this commit? Helpful would be some scsi logging to see which command actually fails. I guess you don't have a serial console? I could set one up, in theory, but the problem is that all my machines are rather dependent on my NFS-mounted $HOME. Guess where it's mounted from... in any case, the machine has no serial port, so it would have to be a usb-serial console, and we know exactly how reliable those are :/ -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Nix == Nix n...@esperi.org.uk writes: Nix spindle:/boot# sg_vpd --page=0x89 /dev/sda ATA information VPD Nix page: fetching VPD page failed Please add -v I'll also need the output of: # sg_vpd -vl Nix I'll try rebooting into a kernel with that commit reverted next. Doesn't matter as far as the sg commands are concerned... -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 29 Jul 2013, Bernd Schubert uttered the following: On 07/29/2013 03:05 PM, Nix wrote: On 29 Jul 2013, Bernd Schubert said: Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen martin.peter...@oracle.com Date: Thu Jun 6 22:15:55 2013 -0400 I can now confirm that reverting this commit causes this problem to go away, and my machine boots fine again. Please revert (and figure out what is wrong so that 3.11 doesn't implode in the same way? I'm happy to assist...) (My apologies if a 'please revert' from someone bitten by a stable regression isn't adequate reason to revert the thing: I've never been quite sure who should report regressions in stable patches to Greg. It should at least be *evidence*. So here's my it crashed and now it doesn't evidence. :} ) -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
Nix == Nix n...@esperi.org.uk writes: Bernd, Nix I can now confirm that reverting this commit causes this problem to Nix go away, and my machine boots fine again. Can you please send me the output of sq_inq with your 1.49 firmware? I made a tweak that allowed Nix to boot but we're trying to find a good blacklist trigger. And that's tricky given that Areca allows you manually specify the SCSI model string for each volume... -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 13-07-29 05:09 PM, Nix wrote: On 29 Jul 2013, Bernd Schubert uttered the following: On 07/29/2013 03:05 PM, Nix wrote: On 29 Jul 2013, Bernd Schubert said: Hi Nick, On 07/29/2013 12:10 PM, Nick Alcock wrote: arcmsr0: abort device command of scsi id = 0 lun = 1 arcmsr0: abort device command of scsi id = 0 lun = 0 arcmsr: executing bus reset eh.num_resets=0, num_[...] arcmsr0: wait 'abort all outstanding command' timeout arcmsr0: executing hw bus reset arcmsr0: waiting for hw bus reset return, retry=0 arcmsr0: waiting for hw bus reset return, retry=1 Areca RAID Controller0: F/W V1.46 2009-01-06 Model ARC-1210 arcmsr: scsi bus reset eh returns with success [and back to the top of the error messages again, apparently forever, not that the machine would be much use without its RAID array even if this loop terminated at some point, so I only gave it a couple of minutes] The failure happens precisely at the moment we transition to early userspace, so presumably userspace I/O is failing (or something related to raw device access, perhaps, since the first thing it does is a vgscan). I haven't bisected yet (sorry, I have work to do which means this machine must be running right now), but nothing has changed in the arcmsr controller, nor in SCSI-land excepting commit 98dcc2946adbe4349ef1ef9b99873b912831edd4 Author: Martin K. Petersen martin.peter...@oracle.com Date: Thu Jun 6 22:15:55 2013 -0400 I can now confirm that reverting this commit causes this problem to go away, and my machine boots fine again. Please revert (and figure out what is wrong so that 3.11 doesn't implode in the same way? I'm happy to assist...) Hi, Please supply the information that Martin Petersen asked for. I just examined a more recent Areca SAS RAID controller and would describe it as the SCSI device from hell. One solution to this problem is to modify the arcmsr driver so it returns a more consistent set of lies to the management SCSI commands that Martin is asking about. Doug Gilbert -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SCSI REGRESSION] 3.10.2 or 3.10.3: arcmsr failure at bootup / early userspace transition
On 30 Jul 2013, Douglas Gilbert outgrape: Please supply the information that Martin Petersen asked for. Did it in private IRC (the advantage of working for the same division of the same company!) I didn't realise the original fix was actually implemented to allow Bernd, with a different Areca controller, to boot... obviously, in that situation, reversion is wrong, since that would just replace one won't- boot situation with another. It looks like a solution is possible that will let us boot *both* my controller (with its old 2009-era firmware) *and* his. We just have to let Martin implement it. Give him time, I only got a successful boot out of it an hour ago :) I just examined a more recent Areca SAS RAID controller and would describe it as the SCSI device from hell. One solution to this problem is to modify the arcmsr driver so it returns a more consistent set of lies to the management SCSI commands that Martin is asking about. I can't help notice that something is skewy in its error handling, too. When the controller errors, even resetting the bus doesn't seem to be enough to bring it back :/ I've seen errors from it before which did *not* lead to it imploding forever, but this is apparently not one such. Certainly Areca-the-company has... issues with communication with the community (i.e., they don't). A shame I didn't know that before I bought the controller and made all my data completely dependent on it, really. Shame, the controller otherwise works very well (fast, and has coped with a disk failure with aplomb). -- NULL (void) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html