Hi,

One of our customers had a problem in their production batch jobs. One COBOL
program writes variable length records to a VB PS data set and after
smoothly running for ten years the program begun to generate VB files with
invalid RDWs as reported by DFSORT which is used to copy the files.  We are
at z/OS 1.9.

I first wrote a program to read the physical blocks and parsed each block
using BDW/RDW mechanism. It shows that there are 97 blocks which contain
invalid RDW(RDW<4 or RDW>LRECL)。No problem found in BDW. They just match the
physical length of corresponding block.

To find the real problem I decided to parse the blocks in a different way.
The COBOL program builds the records as following:

1. After RDW each record contains a 8-byte identifier string 'BGLGLGP ';
2. 120 bytes from identifier string is a two-bytes length field which
contains the length of the variable data of the record;
3. Then comes the variable length part.

So I use the string ''BGLGLGP ' as delimiter to separate each record and
then find its 'real' RDW value in the 4-byte area just before the identifier
string.  I also use the two-byte length field set by user program to
calculate the supposed RDW value.

This time the result is more enlightening. All 97 'bad' blocks are in the
same pattern: (the following is simplified just to illustrate)

 -------------------------------------
| BDW  | RDW1 |   REC1 | RDW2 | REC2  |
 -------------------------------------
1. BDW matches the actual length of the block;
2. RDW1 and RDW2 all match the 'should-be' RDW value calculated from length
field set by application program;

The problem lies in the actual record length.  For REC1, its actual length
is shorter or longer than RDW1 indicates!

When DFSORT or other application like ISPF tries to parse the block via RDW,
it found no problem with record1 cause the value of RDW1 is valid. But it
will get the wrong start address of record2 via RDW1 cause the actual length
of REC1 is not as RDW1 indicates.  That's means it will select the wrong
area as RDW2 and its value is unpredictable.  Thus the so-called 'invalid
rdw'.

How does this happen? I can only think of one situation, that is, when QSAM
is building the block.  After moving record1 into the block, record2 should
be placed right after record1. If for some reason QSAM failed to do so,
record2 will overlay part of record1 or be placed far behind record1.  That
will generate a 'bad' block exactly as what we saw in this case.

To make things more complicated, the customer found one of their channel
paths could not be vary online when they tried to re-IPL the system. So they
temporarily discard the use of that path and then suddenly the problem
disappears.

Now the customer believe the problem is caused by that channel path and they
have reported the problem to IBM.

For me it's hard to accept it cause I cannot see the relationship. Write
operation does involve the interacion with channel subsystem but from my
limited knowledge after QSAM finished the building of the block all the left
operations are at block level and will not touch the inner structure of the
block.  It's nearly impossible for them to generate a 'bad' block exactly as
we saw in this specific case even if truncating or overlay occurs.  So I
still believe the cause of the problem is that QSAM somehow failed to build
the block right.

Any one can give me some hints?


-- 
Best Regards,
Johnny Luo

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO
Search the archives at http://bama.ua.edu/archives/ibm-main.html

Reply via email to