We have been trying to format all minidisks from Linux only and this turned out to be problematic. I am looking for a solution that would let us stay in Linux without having to involve CMS format for every new minidisk. Let me first describe the problem: When there is a record on dasd that has incorrect cylinder in the count area, this leads to "record not found" errors when the dasd is brought online. Since the dasd needs to be online before the problem is fixed (by formatting) the only way around that I can see is to preformat in CMS. If new minidisks are regularly formatted and destroyed, it is possible to run into situation where part of the disk has the correct format and part has the cylinder number in the count area wrong.
Here is a way to reproduce: 1) Create a minidisk and format it with CDL, e.g. MDISK FBAF 3390 4819 1000 VMBL2H WR 2) Delete it and create a minidisk starting at the next cylinder, but half the size of the first one: MDISK FBAF 3390 4820 500 VMBL2H WR 3) Format it With CDL 4) Delete the disk and create a new one, spanning the first disk except for the first cylinder: MDISK FBAF 3390 4820 999 VMBL2H WR This will create a disk that has the first half correct, but the rest of the disk has the cylinders off by one in the count area. 5) Link it from Linux, and put it online When the disk is put online, large number of "record not found" errors appear in the syslog. On some of our real devices, the errors appear in less than a second and the device can be formatted. On other real devices, the errors appear in the course of several minutes (highest I have observed was about 25 minutes). While the errors appear, the device is not usable and cannot be put offline. Why I think this is a problem (beyond cluttered syslog): - The device cannot be put offline until the errors stop appearing. Sometimes dasdfmt with --force stops this, but only as long as the device is present in /dev which is not always the case. - While the errors appear, there is contention on the real device where the minidisk is located. Any other Linuxes running from the real device becomes next to unusable. There is a fix in the newer kernels that deals with a similar problem: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3bc9fef9cc1e4047c3a3c51d84cc1c5d2ef03cea I have tested it and it seems that the initial check is made on the first few cylinders only, if the count errors are further towards the end of the disk, the problem is still present. Here is an example of the "record not found" error: Mar 12 05:52:10 kernelts kernel: dasd-eckd 0.0.fbaf: The specified record was not found Mar 12 05:52:10 kernelts kernel: dasd(eckd): I/O status report for device 0.0.fbaf: Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): in req: 000000001ba4dcf0 CC:00 FC:04 AC:00 SC:17 DS:02 CS:20 fcxs:01 schxs:02 RC:0 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): device 0.0.fbaf: Failing TCW: 000000001ba4de40 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->length 64 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->flags d1 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->dcw_offset 0 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->count 4096 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): residual 4068 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.dev_time 81 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.def_time 0 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.queue_time 0 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.dev_busy_time 0 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.dev_act_time 0 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex) 0- 7: 00 08 00 00 45 e6 3e 00 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex) 8-15: 00 00 00 00 00 00 00 04 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex) 16-23: e5 11 6a 27 85 00 0f 00 Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex) 24-31: 00 00 40 e2 00 03 e6 0e Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): 24 Byte: 0 MSG 0, no MSGb to SYSOP Mar 12 05:52:10 kernelts kernel: Buffer I/O error on device dasdd, logical block 179819 Let me know if I need to supply any more information. Also, can anyone think of a reason why on some real devices the errors appear in seconds and on others it takes such a long time? Thanks, Tomas Tomas Pavelka CA Technologies Sr Software Engineer Tel: +420226207796 [email protected] <mailto:[email protected]>[cid:[email protected]]<http://www.ca.com/> ---------------------------------------------------------------------- For LINUX-390 subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO LINUX-390 or visit http://www.marist.edu/htbin/wlvindex?LINUX-390 ---------------------------------------------------------------------- For more information on Linux on System z, visit http://wiki.linuxvm.org/
<<inline: image001.gif>>
