Hi, One of our customers had a problem in their production batch jobs. One COBOL program writes variable length records to a VB PS data set and after smoothly running for ten years the program begun to generate VB files with invalid RDWs as reported by DFSORT which is used to copy the files. We are at z/OS 1.9.
I first wrote a program to read the physical blocks and parsed each block using BDW/RDW mechanism. It shows that there are 97 blocks which contain invalid RDW(RDW<4 or RDW>LRECL)。No problem found in BDW. They just match the physical length of corresponding block. To find the real problem I decided to parse the blocks in a different way. The COBOL program builds the records as following: 1. After RDW each record contains a 8-byte identifier string 'BGLGLGP '; 2. 120 bytes from identifier string is a two-bytes length field which contains the length of the variable data of the record; 3. Then comes the variable length part. So I use the string ''BGLGLGP ' as delimiter to separate each record and then find its 'real' RDW value in the 4-byte area just before the identifier string. I also use the two-byte length field set by user program to calculate the supposed RDW value. This time the result is more enlightening. All 97 'bad' blocks are in the same pattern: (the following is simplified just to illustrate) ------------------------------------- | BDW | RDW1 | REC1 | RDW2 | REC2 | ------------------------------------- 1. BDW matches the actual length of the block; 2. RDW1 and RDW2 all match the 'should-be' RDW value calculated from length field set by application program; The problem lies in the actual record length. For REC1, its actual length is shorter or longer than RDW1 indicates! When DFSORT or other application like ISPF tries to parse the block via RDW, it found no problem with record1 cause the value of RDW1 is valid. But it will get the wrong start address of record2 via RDW1 cause the actual length of REC1 is not as RDW1 indicates. That's means it will select the wrong area as RDW2 and its value is unpredictable. Thus the so-called 'invalid rdw'. How does this happen? I can only think of one situation, that is, when QSAM is building the block. After moving record1 into the block, record2 should be placed right after record1. If for some reason QSAM failed to do so, record2 will overlay part of record1 or be placed far behind record1. That will generate a 'bad' block exactly as what we saw in this case. To make things more complicated, the customer found one of their channel paths could not be vary online when they tried to re-IPL the system. So they temporarily discard the use of that path and then suddenly the problem disappears. Now the customer believe the problem is caused by that channel path and they have reported the problem to IBM. For me it's hard to accept it cause I cannot see the relationship. Write operation does involve the interacion with channel subsystem but from my limited knowledge after QSAM finished the building of the block all the left operations are at block level and will not touch the inner structure of the block. It's nearly impossible for them to generate a 'bad' block exactly as we saw in this specific case even if truncating or overlay occurs. So I still believe the cause of the problem is that QSAM somehow failed to build the block right. Any one can give me some hints? -- Best Regards, Johnny Luo ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: GET IBM-MAIN INFO Search the archives at http://bama.ua.edu/archives/ibm-main.html