We have been trying to format all minidisks from Linux only and this turned out 
to be problematic. I am looking for a solution that would let us stay in Linux 
without having to involve CMS format for every new minidisk. Let me first 
describe the problem:
When there is a record on dasd that has incorrect cylinder in the count area, 
this leads to "record not found" errors when the dasd is brought online. Since 
the dasd needs to be online before the problem is fixed (by formatting) the 
only way around that I can see is to preformat in CMS.
If new minidisks are regularly formatted and destroyed, it is possible to run 
into situation where part of the disk has the correct format and part has the 
cylinder number in the count area wrong.

Here is a way to reproduce:


1) Create a minidisk and format it with CDL, e.g.

MDISK FBAF 3390 4819 1000 VMBL2H WR

2) Delete it and create a minidisk starting at the next cylinder, but half the 
size of the first one:

MDISK FBAF 3390 4820 500 VMBL2H WR

3) Format it With CDL

4) Delete the disk and create a new one, spanning the first disk except for the 
first cylinder:

MDISK FBAF 3390 4820 999 VMBL2H WR

This will create a disk that has the first half correct, but the rest of the 
disk has the cylinders off by one in the count area.

5) Link it from Linux, and put it online



When the disk is put online, large number of "record not found" errors appear 
in the syslog. On some of our real devices, the errors appear in less than a 
second and the device can be formatted. On other real devices, the errors 
appear in the course of several minutes (highest I have observed was about 25 
minutes). While the errors appear, the device is not usable and cannot be put 
offline.



Why I think this is a problem (beyond cluttered syslog):

- The device cannot be put offline until the errors stop appearing. Sometimes 
dasdfmt with --force stops this, but only as long as the device is present in 
/dev which is not always the case.

- While the errors appear, there is contention on the real device where the 
minidisk is located. Any other Linuxes running from the real device becomes 
next to unusable.



There is a fix in the newer kernels that deals with a similar problem:

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=3bc9fef9cc1e4047c3a3c51d84cc1c5d2ef03cea



I have tested it and it seems that the initial check is made on the first few 
cylinders only, if the count errors are further towards the end of the disk, 
the problem is still present.

Here is an example of the "record not found" error:

Mar 12 05:52:10 kernelts kernel: dasd-eckd 0.0.fbaf: The specified record was 
not found
Mar 12 05:52:10 kernelts kernel: dasd(eckd): I/O status report for device 
0.0.fbaf:
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): in req: 000000001ba4dcf0 CC:00 
FC:04 AC:00 SC:17 DS:02 CS:20 fcxs:01 schxs:02 RC:0
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): device 0.0.fbaf: Failing TCW: 
000000001ba4de40
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->length 64
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->flags d1
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->dcw_offset 0
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->count 4096
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): residual 4068
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.dev_time 81
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.def_time 0
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.queue_time 0
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.dev_busy_time 0
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): tsb->tsa.iostat.dev_act_time 0
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex)  0- 7: 00 08 00 00 
45 e6 3e 00
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex)  8-15: 00 00 00 00 
00 00 00 04
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex) 16-23: e5 11 6a 27 
85 00 0f 00
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): Sense(hex) 24-31: 00 00 40 e2 
00 03 e6 0e
Mar 12 05:52:10 kernelts kernel: <3>dasd(eckd): 24 Byte: 0 MSG 0, no MSGb to 
SYSOP
Mar 12 05:52:10 kernelts kernel: Buffer I/O error on device dasdd, logical 
block 179819

Let me know if I need to supply any more information. Also, can anyone think of 
a reason why on some real devices the errors appear in seconds and on others it 
takes such a long time?

Thanks,
Tomas

Tomas Pavelka
CA Technologies
Sr Software Engineer
Tel:  +420226207796
[email protected]

<mailto:[email protected]>[cid:[email protected]]<http://www.ca.com/>

----------------------------------------------------------------------
For LINUX-390 subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO LINUX-390 or visit
http://www.marist.edu/htbin/wlvindex?LINUX-390
----------------------------------------------------------------------
For more information on Linux on System z, visit
http://wiki.linuxvm.org/

<<inline: image001.gif>>

Reply via email to