Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS 32bit issue) signal is 11

patrick zandi Fri, 23 Sep 2011 11:23:10 -0700

yes: itsm 7.1 P6 is running,

I am looking to upgrade Patch levels without any promises..  but I gotta
test on a dev system..
.. Which is another issue: of issues...


Thanks guys.. I am glad you all caught one thing I did not check.. and I am
not out of my mind.. yet!


On Fri, Sep 23, 2011 at 2:07 PM, Ben Chernys <
[email protected]> wrote:

> **
>
> OK.  So, it ain’t a denied malloc.  You got a bug that somehow you are
> exposing and seemingly no one else is.  Tough one.  BMC could give you a
> debug build, you’d generate a core, and give it to some BMC folks.  That
> would require some very high level of support I would assume (though I have
> seen that ages ago).****
>
> ** **
>
> I still think the best bet is to trace and hunt around filters to see if
> anything unusual is the culprit.  As Mark said, make sure your system is
> taking cores.  You have a process ending with SIGSEGV; that always ends up
> with a core.  I’ve never seen a system where it didn’t but then I am no Unix
> admin.  ****
>
> ** **
>
> There are other options such as plugging in stuff to log system status at
> intervals etc.  But your best bet is definitely to turn on logging and
> review logs.  Try to reproduce this in a safe (non-production) environment.
> ****
>
> ** **
>
> BTW, the OS or armonitor doesn’t terminate the process.  This is a hardware
> interrupt that the OS catches.  ARS will have a signal handler and that gets
> control.  That is what prints the Signal 11 trace.****
>
> ** **
>
> Are you running ITSM?  You didn’t say.  But code paths within the arserver
> will vary even with the same ARS applications for so many reasons as to make
> it astronomical.  ****
>
> ** **
>
> Were there any filter / db / environment changes just before the symptoms
> started showing up?  These are good areas to investigate as well.****
>
> ** **
>
> These types of problems can be solved (circumvented) quickly or can take
> forever (ie not get solved).  But they are quite involved and interesting.
> ****
>
> ** **
>
> Good luck!****
>
> Ben****
>
> ** **
>
> *From:* Action Request System discussion list(ARSList) [mailto:
> [email protected]] *On Behalf Of *patrick zandi
> *Sent:* September-23-11 15:38
>
> *To:* [email protected]
> *Subject:* Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS
> 32bit issue) signal is 11****
>
> ** **
>
> ** Perplexing is the (no core bombs)
> We are running fine and them Boom, we get that signal: and armonitor shuts
> the arserverd down and restarts a new one..
> What I was hoping was someone would Say I remember that.. and Patch X,Y,Z
> fixed that..
>
> I did conplete logging and the log stops and the log starts (nothing after
> the signal and startup).. Clear.. Clean..
>
> I looked at all the patches and fine the following:: but cannot put a
> finger on any one of them (Specifically)  (NL--not likely)  (ML- most
> likely)****
>
> SW00351599       The AR System server crashed when a filter used converted
> values in a Set Field action.****
>
> SW00351922       The AR System server program terminated when building a
> userList to send notifications.****
>
> ML-SW00356647       Too many filters executed recursively causing a stack
> overflow, which resulted in failure of the AR System server.****
>
> SW00328336       The AR System server crashed while saving text to a Diary
> field on which audit was enabled.****
>
> SW00328337       arserverd crashed while attempting to communicate with the
> plug-in server through PluginServerCallWithRetry.****
>
> SW00337127       A memory leak issue occurred in the AR System server.****
>
> NL-SW00338411       The AR System server crashed during an archiving
> process if the archive source form contained non-data fields such as Text,
> Trim, Button, and so on.****
>
> NL-SW00346370       The AR System server crashed while processing an error
> handling filter.****
>
> SW00314816       When a user performed a search on a Join form, but did not
> have permissions to view all records returned in the result, the AR System
> server crashed. ****
>
> NL--SW00322802       Creating an entry with a user name larger than 180
> bytes caused the AR System server to crash when the status history was being
> recorded and the initial status was not New.****
>
>
>
> ****
>
> On Fri, Sep 23, 2011 at 5:09 AM, Ben Chernys <
> [email protected]> wrote:****
>
> ** ****
>
> Hi Mark, Patrick,****
>
>  ****
>
> Signal 11 is SIGSEGV which is not necessarily a malloc failure though
> indeed a malloc failure may lead to it.  It is not always possible to log
> malloc failures – after all it takes some memory to cut a log record.  ***
> *
>
>  ****
>
> A segmentation violation is always the result of bad code (accessing memory
> not allocated to the process or not in the processes address space – which 0
> is a candidate (malloc’s return value on failure)).  ****
>
>  ****
>
> That being said, it is possible to not trigger the execution path with that
> bad code by altering filters etc, so definitely the route to go on is along
> the lines that Mark talked:   the core is always a wealth of info – even
> though ARS will not have debugging compiled in ;-)  I would also turn on all
> logging.  SQL, API, Filter on the server, and unlimited, and pointing to the
> same file until the next occurrence.  Then you will have a wealth of ARS
> information to go through.  Generally something will stand out.****
>
>  ****
>
> Recursive filter loops are usually trapped by the maximum filter limit –
> though if that is set high enough the process will run out of memory before
> hitting up against that.  If yours is high, you could try setting it lower.
> ****
>
>  ****
>
> You may also want to go to a higher patch level if one is available.  I am
> no longer that familiar with the patches available on 7.1.****
>
>  ****
>
> Also, I know that memory on  Solaris may be restricted by the admin.  (I
> forget the commands to determine this – but they will be easily found on the
> web).  ulimits Perhaps?****
>
>  ****
>
> Cheers****
>
>  ****
>
> Ben Chernys
>
> Senior Software Architect
> Software Tool House Inc.
>
> Canada / Deutschland / Germany
> Mobile:      +49 171 380 2329    GMT + 1 + [ DST ]
> Email:       Ben.Chernys _AT_ 
> softwaretoolhouse.com<[email protected]>
> Web:         www.softwaretoolhouse.com
>
> Check out Software Tool House's free Diary Editor.
>
> *Meta-Update,* our premium ARS Data tool, lets you automate
> your imports, migrations, *in no time at all*, without programming,
> without staging forms, without merge workflow.
> http://www.softwaretoolhouse.com/  ****
>
>  ****
>
>  ****
>
> *From:* Action Request System discussion list(ARSList) [mailto:
> [email protected]] *On Behalf Of *Walters, Mark
> *Sent:* September-23-11 09:08
> *To:* [email protected]
> *Subject:* Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS
> 32bit issue) signal is 11****
>
>  ****
>
> ** ****
>
> It may be memory but I would expect to see malloc errors (ARERR 300) in the
> arerror.log if this was the case.  The fact  you’re not seeing a stack trace
> like this;****
>
>  ****
>
> Mon Sep 20 08:33:52 2010     6****
>
>   Timestamp: Mon Sep 20 2010 08:33:52.1865****
>
>   Thread Id: 4****
>
>   Version: 7.1.00 Patch 009 201009200800 ****
>
>   ServerName: test71****
>
>   Database: SQL -- Oracle****
>
>   Hardware: sun4u****
>
>   OS: SunOS 5.10****
>
>   RPC Id: 337****
>
>   RPC Call: 106 (GLXS)****
>
>   RPC Queue: 390600****
>
>   Client: User Demo from Remedy Administrator (protocol 13) at IP address
> 192.168.1.54****
>
>   Form:****
>
>   Logging On:****
>
>  ****
>
> suggests it may be a recursive filter – on Solaris this often causes a
> crash without logging anything useful.  Check to see whether there are any
> core files in the server/bin directory as this is another symptom of this
> type of crash on Solaris.  If cores are enabled (check with the OS coreadm
> command) then the server may create them even though you’re not running a
> debug build.****
>
>  ****
>
> If you do have some core files then run the pstack command against them
> (pstack core) and you will be able to see the stack of each thread within
> the server – if it is a recursive filter causing a stack overflow then one
> of the threads should stand out as being much bigger than the others.
> Depending on what you see you may then need to enable FILTER/SQL logging to
> try and capture the workflow that is causing the crash.  It’s also worth
> checking the Filter-Max-Stack value in ar.conf – various installers set this
> to a very high value – try reducing it back down to 50 or so and this should
> stop most filter recursion crashes and log an error instead.****
>
>  ****
>
> Mark****
>
>  ****
>
> I work for BMC, I don’t speak for them.****
>
>  ****
>
>  ****
>
> *From:* Action Request System discussion list(ARSList) [mailto:
> [email protected]] *On Behalf Of *patrick zandi
> *Sent:* 22 September 2011 21:07
> *To:* [email protected]
> *Subject:* ARS 7.1 P6 Server -- 4 days restarting (possible memory OS
> 32bit issue) signal is 11****
>
>  ****
>
> ** Just a Quick Question:: ARS 7.1 P6 :: on solaris 10, I am seeing a
> Operating system telling the ars to shutdown about every 4 -6 days..
> not positive, nothing in debugging of logs at all, only in the
> ARMONITOR.log  where it says.. ****
>
> 2011     ARMonitor child process (pid:15277) died with 11. And the signal
> is 11.****
>
> ./arserverd****
>
>
> Can I assume Signal 11 is Memory?  --- I have seen alot of memory issues
> with a 11 signal in the arslist...
>
>
> --
> Patrick Zandi
> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ ****
>
> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ ****
>
> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_****
>
>
>
>
> --
> Patrick Zandi
> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_ ****
> _attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_
>



-- 
Patrick Zandi

_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"

Re: ARS 7.1 P6 Server -- 4 days restarting (possible memory OS 32bit issue) signal is 11

Reply via email to