Pawel (privately) wrote: > Hi guys, > > We are trying to find out details (locate processes=jobs which utilise > so much memory). > > We did not apply "export MALLOCTYPE=buckets" to .profiles as HelpDesk > suggested. This option is exported only on our one test area, which is > rarely processed. > > I read http://www.redbooks.ibm.com/redbooks/pdfs/sg247463.pdf and: ... > It seems to me that we should try what Jim suggested: "export > MALLOCTYPE=watson" and perhaps some MALLOCOPTIONS settings. > Yep, start without options though. I see your next post, but I will address that in a reply to that post. > Let's leave it for the second, because something interesting shown on > our LIVE area today in the very morning. Session that runs TSM produced > following output: > jsh techuser ~ -->START.TSM > START.TSM > Phantom process started on process id 2449420 > [2449420] Done : tSA 1 > jsh techuser ~ -->Process ID 1232928 , port 794 , hangup > Program source name CLEAR.TOKENS , line 26 > Recursive debugger calls - program aborting > Yeah - there is some serious problem going on, but to be honest this is probably just out of memory again. There is very little you can do as a programmer once you are out of memory. Even trying to print a message will run out of memory unless you know you can free some. Some people guard against this by keeping a small allocation of memory to be freed in case of an abort, in the hope they can use it to get enough memory to recover or show a message.
The key to this is the recursive debugger calls. This means that being out of memory the program tried to access a memory pointer that wasn't valid (it should not really do this, but as you can't really do anything at this point, it is a moot point) - there is a trap routine that sees this invalid access and it aborts the program and enters the debugger. However, being out of memory, the debugger tries to use some memory and accesses an invalid pointer, which means the trap triggers and it tries to enter the debugger, where it detects the recursion and just aborts as there is no other option. > Process ID 5054700 , port 794 , hangup > Program source name CLEAR.TOKENS , line 27 > Recursive debugger calls - program aborting > jBASE: Segmentation violation. Aborting > This is the out of memory stuff. > cp: ../bnk.data/int.data/DM.TEMP/DC.CARD.ISSUE.HIS.DM: No such file or > directory > This is a bug in your application, which is not detecting that the program did not finish correctly and is trying to copy the results of it anyway. > jBASE: Segmentation violation. Aborting > jBASE: Attempting to free NULL pointer at > jediTransaction.c,1636(EB.TRANS.JBASE, > 26) > jsh techuser ~ -->Process ID 7737374 , port 805 , hangup > Program source name F.READ , line 7 > This is basically all the same thing. I will defer until answering your next message, which is where the problem is I think. > You can ignore hangups, but I am worried about these jBASE errors > (Segmentation violation / Attempting to free NULL pointer at > jediTransaction.c,1636). This does not sound good to me. We do not know > which processes thrown these messages, but likely they were COB agents. > IT is just taht you were out of memory and things are trying to clean up. These are the symptoms not the cause. > I do not know yet wheter physical / swap memory run out yesterday on > PROD, but it quite unlikely (total memory of LIVE system is 2-3 times > bigger than on test machines). > > I would like to mention one fact from the past. > > During "start of year" (2nd January) processing we faced 1 "little" > problem: > a) one of the single threaded jobs did a large transaction (over 900k of > changes) - we have already requested to improve this core EOY job > Did that happen? > b) then later one of the batch sessions (agent) failed with > SUBROUTINE_CALL_FAIL error. There was nothing wrong with our libraries - > called object was there and routine that failed was successfully called > by other COB agents. Only one COB agent noted SUBROUTINE_CALL_FAIL > error, which seemed to be very strange. We have raised that and CSHD > conclusion was: "agent run out of shared memory" Shared memory is not used for this. That was either their mis-description of the problem or you are recalling what happens on UniVerse ;-). Basically this can only really happen when there is no real memory, whic to confuse things, is called virtual :-), for the process to map in the subroutine, or allocate the descriptor for it and so on. All your problems are likely caused by the answer to the next post you made. > (ulimit is unlimited on > LIVE) so use "slibclean" periodically to reclaim memory. > This is probably what they mean. There is a fundamental design issue with AIX and when it feels it can get rid of shared objects. Whatever IBM try to claim about this, they avoid the question "Hmm, then why does no other UNIX suffer from this issue?" > I think now that agent which failed on 2nd January performed in previous > steps large transaction, means allocated large "transaction buffer" and > finally got SUBROUTINE_CALL_FAIL on one of the following jobs (not > immediately). > That is why I suggested that "transaction buffer" may not get downsized > or leaks memory. I also guess that it may not be a leak, but default > MALLOC allocator fault. Yep. But of course something is causing you to need huge amounts of memory. I suspect that it is a bug in JQL, which we can find a work around for (am I on the clock yet?) in the next email. > I am not sure if Watson will help, but reading > Jim's emails we will give it try. > It will definitely help generally, but it is more likely to expose the real problem, which reading your next email, it seems it has :-) Jim --~--~---------~--~----~------------~-------~--~----~ Please read the posting guidelines at: http://groups.google.com/group/jBASE/web/Posting%20Guidelines IMPORTANT: Type T24: at the start of the subject line for questions specific to Globus/T24 To post, send email to [email protected] To unsubscribe, send email to [email protected] For more options, visit this group at http://groups.google.com/group/jBASE?hl=en -~----------~----~----~----~------~----~------~--~---
