This is an issue with the same code as from my previous question but now I
have got something really weird :-
Sorry for the long background explanation but the process that I have
inherited (and hacked around with is as follows) :-
a) Server 1 (5 similar severs looking in different places) Wakes up
every 2 minutes and does the following :-
- looks for logs to upload to the SFS repository. Each log will be
complete since midnight (not just new data).
- creates a user lock file for each log it finds
- uploads the log to a temporary name (in the SFS pool)
- Erases the fn 1LOG file (if it exists)
- Renames the fn LOG file to fn 1LOG (if it exists)
- Renames the tempoary file (just uploaded) to fn LOG
- Removes the User lock file
This complex procedure is because Server 2 might be examining the
file.
b) Server 2 (10 servers, each dealing with different logs) wakes up
every 50 seconds and does the following :-
- Looks in the SFS repository for logs that have more lines than
currently processed
- If there is a user lock file then it ignore the file until next
time
- It scans each line (from the last one processed to the end) and
will take various actions (such as cataloguing tapes used).
- It then records a line pointer for how far it got.
Originally, Server 1 did not mess around with renames but just overwrote
the log in the SFS repository but we were getting a number of lock ups so
I changed it and this has significantly improved the situation - but we
still get the occasional lock up.
The code in Server 2 is very messy, delicate & sensitive so I don't want
to mess with it if I can avoid it.
What we are seeing now is that, occasionally, the Rename of the temporary
file to fn LOG gives RC=28 with DMSRND1311E Object already exists. In the
error handling routine I do a listfile and the object is not shown. I also
do a query locks and get a 'No Locks held'. Just in case of a timing
issue, I then retry the RENAME upto 5 times with a 5 second delay - with
no improvement.
The really weird thing is that, once this happens, it will keep happening
for ever & a day until the server is recycled (IPL CMS). Then the
condition gets cleared.
I can understand that there may be a initial timing issue but, as no SFS
locks are shown, I cannot understand what it is that is not being cleared,
I would be very grateful if anyone can make any suggestions what may be
happening here?
Colin Allinson
VM Systems Support
Amadeus Data Processing GmbH