Re: Prod server down - services will not stay up

2009-11-13 Thread Ben Chernys
The signal 11 is bad code - simple as that.  It's a segmentation violation
which means that the server (arserverd) attempted to read or write to an
address not allocated to its virtual space.  It can also be caused by a
double free or two pointers to one block which has been freed.  In any
event, you cannot fix this without the ARS source code which I expect you
would find hard to get.
 
That being said, the easiest way to determine (and then circumvent) these
types of things is to turn on SQL logging on the server before the system
starts (through the ar.conf file).  The exact settings are in the
configuring ARS guide.
 
Then, when the blow up happens, see what the server was attempting to do.
You can usually spot some possible internal database inconsistencies (in ARS
meta-data) in this way and then repair them manually through SQL before the
ARS start-up.
 
Additionally, there may be patches available that address the problem.
 
Cheers
Ben Chernys
 
 

  _  

From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:30 PM
To: arslist@ARSLIST.ORG
Subject: Prod server down - services will not stay up


** 
Help !!
 
Working with support but could use anyone else's input.  I'm at WWRUG so
it's somewhat limiting.
 
We did a truss log and and when the services drop (arerror 91) we see the
following:
167
/11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
/11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
/11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
/11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
/11:Received signal #11, SIGSEGV [caught]
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 
The services do restart automatically so armonitor is doing it's job.  We've
commented out everything from armonitor but the arserverd command.
 
We stay up for between 2-10 minutes and then wham, we're down again.
Obviously this just started this morning.
 
unix sun solaris 10
oracle 10g
ars 7.0.1P2
 
They did expand the database size last night if that has any bearing.  But
we can connect to the database successfully when ar is down.
 
Nothing else helpful in arerror.log, only 91 error.
 
I'm at the Hardrock hotel, call room 30601 if you have questions or can
help!
 
Thanks,
Susan
 
 
 

 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 

___
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are


Re: Prod server down - services will not stay up

2009-11-13 Thread Ben Chernys
 
PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
about.   The 91 is another process not being able to communicate with the
arserverd process.
 
Cheers
Ben

  _  

From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] 
Sent: November 13, 2009 8:42 PM
To: 'arslist@ARSLIST.ORG'
Subject: RE: Prod server down - services will not stay up


The signal 11 is bad code - simple as that.  It's a segmentation violation
which means that the server (arserverd) attempted to read or write to an
address not allocated to its virtual space.  It can also be caused by a
double free or two pointers to one block which has been freed.  In any
event, you cannot fix this without the ARS source code which I expect you
would find hard to get.
 
That being said, the easiest way to determine (and then circumvent) these
types of things is to turn on SQL logging on the server before the system
starts (through the ar.conf file).  The exact settings are in the
configuring ARS guide.
 
Then, when the blow up happens, see what the server was attempting to do.
You can usually spot some possible internal database inconsistencies (in ARS
meta-data) in this way and then repair them manually through SQL before the
ARS start-up.
 
Additionally, there may be patches available that address the problem.
 
Cheers
Ben Chernys
 
 

  _  

From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:30 PM
To: arslist@ARSLIST.ORG
Subject: Prod server down - services will not stay up


** 
Help !!
 
Working with support but could use anyone else's input.  I'm at WWRUG so
it's somewhat limiting.
 
We did a truss log and and when the services drop (arerror 91) we see the
following:
167
/11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
/11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
/11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
/11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
/11:Received signal #11, SIGSEGV [caught]
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 
The services do restart automatically so armonitor is doing it's job.  We've
commented out everything from armonitor but the arserverd command.
 
We stay up for between 2-10 minutes and then wham, we're down again.
Obviously this just started this morning.
 
unix sun solaris 10
oracle 10g
ars 7.0.1P2
 
They did expand the database size last night if that has any bearing.  But
we can connect to the database successfully when ar is down.
 
Nothing else helpful in arerror.log, only 91 error.
 
I'm at the Hardrock hotel, call room 30601 if you have questions or can
help!
 
Thanks,
Susan
 
 
 

 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 

___
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are


Re: Prod server down - services will not stay up

2009-11-13 Thread Susan Palmer
Thanks Ben

We're having problems determining where the 11 is coming from.

On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys 
ben.cher...@softwaretoolhouse.com wrote:

 **

  PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
 about.   The 91 is another process not being able to communicate with the
 arserverd process.

 Cheers
 Ben

  --
 *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com]
 *Sent:* November 13, 2009 8:42 PM
 *To:* 'arslist@ARSLIST.ORG'
 *Subject:* RE: Prod server down - services will not stay up

   The signal 11 is bad code - simple as that.  It's a segmentation
 violation which means that the server (arserverd) attempted to read or
 write to an address not allocated to its virtual space.  It can also be
 caused by a double free or two pointers to one block which has been freed.
 In any event, you cannot fix this without the ARS source code which I expect
 you would find hard to get.

 That being said, the easiest way to determine (and then circumvent) these
 types of things is to turn on SQL logging on the server before the system
 starts (through the ar.conf file).  The exact settings are in the
 configuring ARS guide.

 Then, when the blow up happens, see what the server was attempting to do.
 You can usually spot some possible internal database inconsistencies (in ARS
 meta-data) in this way and then repair them manually through SQL before the
 ARS start-up.

 Additionally, there may be patches available that address the problem.

 Cheers
 Ben Chernys



  --
 *From:* Action Request System discussion list(ARSList) [mailto:
 arsl...@arslist.org] *On Behalf Of *Susan Palmer
 *Sent:* November 13, 2009 8:30 PM
 *To:* arslist@ARSLIST.ORG
 *Subject:* Prod server down - services will not stay up

 **
 Help !!

 Working with support but could use anyone else's input.  I'm at WWRUG so
 it's somewhat limiting.

 We did a truss log and and when the services drop (arerror 91) we see the
 following:
 167
 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
 /11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
 /11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 /11:Received signal #11, SIGSEGV [caught]
 /11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C

 The services do restart automatically so armonitor is doing it's job.
 We've commented out everything from armonitor but the arserverd command.

 We stay up for between 2-10 minutes and then wham, we're down again.
 Obviously this just started this morning.

 unix sun solaris 10
 oracle 10g
 ars 7.0.1P2

 They did expand the database size last night if that has any bearing.  But
 we can connect to the database successfully when ar is down.

 Nothing else helpful in arerror.log, only 91 error.

 I'm at the Hardrock hotel, call room 30601 if you have questions or can
 help!

 Thanks,
 Susan





 _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
 Are_
  _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
 Are_


___
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are


Re: Prod server down - services will not stay up

2009-11-13 Thread Darrell Reading
Did someone change kernel setting on the system last night when they did
the DB changes?  What size is the executable upon getting the SEGV?  If
remedy has been running fine up to this point and is crashing that
often, then something changed.  
 

Darrell Reading Systems Engineer 
Phone 479.204.5739 
dere...@wal-mart.com 

Wal-Mart Stores, Inc. 
805 Moberly Lane, MS-0560-68 
Bentonville, AR 72716 
Save Money. Live Better 

 



From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: Friday, November 13, 2009 13:57
To: arslist@ARSLIST.ORG
Subject: Re: Prod server down - services will not stay up


** 
Thanks Ben
 
We're having problems determining where the 11 is coming from.


On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys
ben.cher...@softwaretoolhouse.com wrote:


** 
 
PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need
to worry about.   The 91 is another process not being able to
communicate with the arserverd process.
 
Cheers
Ben



From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] 
Sent: November 13, 2009 8:42 PM
To: 'arslist@ARSLIST.ORG'
Subject: RE: Prod server down - services will not stay up


The signal 11 is bad code - simple as that.  It's a
segmentation violation which means that the server (arserverd)
attempted to read or write to an address not allocated to its virtual
space.  It can also be caused by a double free or two pointers to one
block which has been freed.  In any event, you cannot fix this without
the ARS source code which I expect you would find hard to get.
 
That being said, the easiest way to determine (and then
circumvent) these types of things is to turn on SQL logging on the
server before the system starts (through the ar.conf file).  The exact
settings are in the configuring ARS guide.
 
Then, when the blow up happens, see what the server was
attempting to do.  You can usually spot some possible internal database
inconsistencies (in ARS meta-data) in this way and then repair them
manually through SQL before the ARS start-up.
 
Additionally, there may be patches available that address the
problem.
 
Cheers
Ben Chernys
 
 



From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:30 PM
To: arslist@ARSLIST.ORG
Subject: Prod server down - services will not stay up


** 
Help !!
 
Working with support but could use anyone else's input.  I'm at
WWRUG so it's somewhat limiting.
 
We did a truss log and and when the services drop (arerror 91)
we see the following:
167
/11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
/11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
/11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
/11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
/11:Received signal #11, SIGSEGV [caught]
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 
The services do restart automatically so armonitor is doing it's
job.  We've commented out everything from armonitor but the arserverd
command.
 
We stay up for between 2-10 minutes and then wham, we're down
again.  Obviously this just started this morning.
 
unix sun solaris 10
oracle 10g
ars 7.0.1P2
 
They did expand the database size last night if that has any
bearing.  But we can connect to the database successfully when ar is
down.
 
Nothing else helpful in arerror.log, only 91 error.
 
I'm at the Hardrock hotel, call room 30601 if you have questions
or can help!
 
Thanks,
Susan
 
 
 

 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the
Answers Are_ 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the
Answers Are_ 


_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 



-
**
This email and any files transmitted with it are confidential and
intended solely for the individual or entity to whom they are
addressed. If you have received this email in error destroy it
immediately.
**
Wal-Mart Confidential

Re: Prod server down - services will not stay up

2009-11-13 Thread Ben Chernys
These are a bit of a pain to solve.  SQL logging on startup is the key.  The
logs are quite big but usually the last lines will be pertinent.  You also
need to know the database structure of the meta-data - given by the database
reference guide.
 
I'm afraid that these types of problems are not likely to get solved whilst
in a hotel room as once you have the idea of where the problem lies (through
the log) you then need to research the meta-data itself.  The sql log will
simply let you know the meta-data table that was last read and not which
record of that table caused the server to crash.
 
7.0.1 p2 seems a little low.  It is possible to patch the binaries?  
 
It is unlikely that simply allocating more database space is the problem.
You could also look at the temporary space and see that it was increased but
I would go with the logs first.
 
2 - 10 minutes will most surely be in the servers initial processing of the
meta-data. 
Cheers
Ben

  _  

From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:57 PM
To: arslist@ARSLIST.ORG
Subject: Re: Prod server down - services will not stay up


** 
Thanks Ben
 
We're having problems determining where the 11 is coming from.


On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys
ben.cher...@softwaretoolhouse.com wrote:


** 
 
PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
about.   The 91 is another process not being able to communicate with the
arserverd process.
 
Cheers
Ben

  _  

From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] 
Sent: November 13, 2009 8:42 PM
To: 'arslist@ARSLIST.ORG'
Subject: RE: Prod server down - services will not stay up


The signal 11 is bad code - simple as that.  It's a segmentation violation
which means that the server (arserverd) attempted to read or write to an
address not allocated to its virtual space.  It can also be caused by a
double free or two pointers to one block which has been freed.  In any
event, you cannot fix this without the ARS source code which I expect you
would find hard to get.
 
That being said, the easiest way to determine (and then circumvent) these
types of things is to turn on SQL logging on the server before the system
starts (through the ar.conf file).  The exact settings are in the
configuring ARS guide.
 
Then, when the blow up happens, see what the server was attempting to do.
You can usually spot some possible internal database inconsistencies (in ARS
meta-data) in this way and then repair them manually through SQL before the
ARS start-up.
 
Additionally, there may be patches available that address the problem.
 
Cheers
Ben Chernys
 
 

  _  

From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:30 PM
To: arslist@ARSLIST.ORG
Subject: Prod server down - services will not stay up


** 
Help !!
 
Working with support but could use anyone else's input.  I'm at WWRUG so
it's somewhat limiting.
 
We did a truss log and and when the services drop (arerror 91) we see the
following:
167
/11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
/11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
/11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
/11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
/11:Received signal #11, SIGSEGV [caught]
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 
The services do restart automatically so armonitor is doing it's job.  We've
commented out everything from armonitor but the arserverd command.
 
We stay up for between 2-10 minutes and then wham, we're down again.
Obviously this just started this morning.
 
unix sun solaris 10
oracle 10g
ars 7.0.1P2
 
They did expand the database size last night if that has any bearing.  But
we can connect to the database successfully when ar is down.
 
Nothing else helpful in arerror.log, only 91 error.
 
I'm at the Hardrock hotel, call room 30601 if you have questions or can
help!
 
Thanks,
Susan
 
 
 

 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 


_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 

___
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are


Re: Prod server down - services will not stay up

2009-11-13 Thread Ben Chernys
In the arerror log, you will see a line that has nothing in it except the
11.  It is the arserverd daemon. 
 

  _  

From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:57 PM
To: arslist@ARSLIST.ORG
Subject: Re: Prod server down - services will not stay up


** 
Thanks Ben
 
We're having problems determining where the 11 is coming from.


On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys
ben.cher...@softwaretoolhouse.com wrote:


** 
 
PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
about.   The 91 is another process not being able to communicate with the
arserverd process.
 
Cheers
Ben

  _  

From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] 
Sent: November 13, 2009 8:42 PM
To: 'arslist@ARSLIST.ORG'
Subject: RE: Prod server down - services will not stay up


The signal 11 is bad code - simple as that.  It's a segmentation violation
which means that the server (arserverd) attempted to read or write to an
address not allocated to its virtual space.  It can also be caused by a
double free or two pointers to one block which has been freed.  In any
event, you cannot fix this without the ARS source code which I expect you
would find hard to get.
 
That being said, the easiest way to determine (and then circumvent) these
types of things is to turn on SQL logging on the server before the system
starts (through the ar.conf file).  The exact settings are in the
configuring ARS guide.
 
Then, when the blow up happens, see what the server was attempting to do.
You can usually spot some possible internal database inconsistencies (in ARS
meta-data) in this way and then repair them manually through SQL before the
ARS start-up.
 
Additionally, there may be patches available that address the problem.
 
Cheers
Ben Chernys
 
 

  _  

From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:30 PM
To: arslist@ARSLIST.ORG
Subject: Prod server down - services will not stay up


** 
Help !!
 
Working with support but could use anyone else's input.  I'm at WWRUG so
it's somewhat limiting.
 
We did a truss log and and when the services drop (arerror 91) we see the
following:
167
/11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
/11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
/11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
/11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
/11:Received signal #11, SIGSEGV [caught]
/11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 
The services do restart automatically so armonitor is doing it's job.  We've
commented out everything from armonitor but the arserverd command.
 
We stay up for between 2-10 minutes and then wham, we're down again.
Obviously this just started this morning.
 
unix sun solaris 10
oracle 10g
ars 7.0.1P2
 
They did expand the database size last night if that has any bearing.  But
we can connect to the database successfully when ar is down.
 
Nothing else helpful in arerror.log, only 91 error.
 
I'm at the Hardrock hotel, call room 30601 if you have questions or can
help!
 
Thanks,
Susan
 
 
 

 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 
_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 


_Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
Are_ 

___
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are


Re: Prod server down - services will not stay up

2009-11-13 Thread Susan Palmer
We've turned on both sql and api logging now to capture the next event.

How would the db space affect this?  They actually just expanded it last
night.

Working with the unix guys in the office and support, just not the same when
you're not there.




On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys 
ben.cher...@softwaretoolhouse.com wrote:

 **
 These are a bit of a pain to solve.  SQL logging on startup is the key.
 The logs are quite big but usually the last lines will be pertinent.  You
 also need to know the database structure of the meta-data - given by the
 database reference guide.

 I'm afraid that these types of problems are not likely to get solved whilst
 in a hotel room as once you have the idea of where the problem lies (through
 the log) you then need to research the meta-data itself.  The sql log will
 simply let you know the meta-data table that was last read and not which
 record of that table caused the server to crash.

 7.0.1 p2 seems a little low.  It is possible to patch the binaries?

 It is unlikely that simply allocating more database space is the problem.
 You could also look at the temporary space and see that it was increased but
 I would go with the logs first.

 2 - 10 minutes will most surely be in the servers initial processing of the
 meta-data.
 Cheers
 Ben
  --
  *From:* Action Request System discussion list(ARSList) [mailto:
 arsl...@arslist.org] *On Behalf Of *Susan Palmer
 *Sent:* November 13, 2009 8:57 PM
 *To:* arslist@ARSLIST.ORG
 *Subject:* Re: Prod server down - services will not stay up

 **
  Thanks Ben

 We're having problems determining where the 11 is coming from.

 On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys 
 ben.cher...@softwaretoolhouse.com wrote:

 **

  PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
 about.   The 91 is another process not being able to communicate with the
 arserverd process.

 Cheers
 Ben

  --
 *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com]
 *Sent:* November 13, 2009 8:42 PM
 *To:* 'arslist@ARSLIST.ORG'
 *Subject:* RE: Prod server down - services will not stay up

   The signal 11 is bad code - simple as that.  It's a segmentation
 violation which means that the server (arserverd) attempted to read or
 write to an address not allocated to its virtual space.  It can also be
 caused by a double free or two pointers to one block which has been freed.
 In any event, you cannot fix this without the ARS source code which I expect
 you would find hard to get.

 That being said, the easiest way to determine (and then circumvent) these
 types of things is to turn on SQL logging on the server before the system
 starts (through the ar.conf file).  The exact settings are in the
 configuring ARS guide.

 Then, when the blow up happens, see what the server was attempting to do.
 You can usually spot some possible internal database inconsistencies (in ARS
 meta-data) in this way and then repair them manually through SQL before the
 ARS start-up.

 Additionally, there may be patches available that address the problem.

 Cheers
 Ben Chernys



  --
 *From:* Action Request System discussion list(ARSList) [mailto:
 arsl...@arslist.org] *On Behalf Of *Susan Palmer
 *Sent:* November 13, 2009 8:30 PM
 *To:* arslist@ARSLIST.ORG
 *Subject:* Prod server down - services will not stay up

 **
 Help !!

 Working with support but could use anyone else's input.  I'm at WWRUG so
 it's somewhat limiting.

 We did a truss log and and when the services drop (arerror 91) we see the
 following:
 167
 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254
 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161
 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247
 /11:Incurred fault #6, FLTBOUNDS  %pc = 0xFE6A3558
 /11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C
 /11:Received signal #11, SIGSEGV [caught]
 /11:  siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C

 The services do restart automatically so armonitor is doing it's job.
 We've commented out everything from armonitor but the arserverd command.

 We stay up for between 2-10 minutes and then wham, we're down again.
 Obviously this just started this morning.

 unix sun solaris 10
 oracle 10g
 ars 7.0.1P2

 They did expand the database size last night if that has any bearing.  But
 we can connect to the database successfully when ar is down.

 Nothing else helpful in arerror.log, only 91 error.

 I'm at the Hardrock hotel, call room 30601 if you have questions or can
 help!

 Thanks,
 Susan





 _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
 Are_
  _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
 Are_


 _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
 Are_
  _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers
 Are_

Re: Prod server down - services will not stay up

2009-11-13 Thread Darrell Reading
More specifically, check maxsiz.  Is the arserverd core dumping?  What
size is the core?
 

Darrell Reading Systems Engineer 
Phone 479.204.5739 
dere...@wal-mart.com 

Wal-Mart Stores, Inc. 
805 Moberly Lane, MS-0560-68 
Bentonville, AR 72716 
Save Money. Live Better 

 



From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: Friday, November 13, 2009 14:13
To: arslist@ARSLIST.ORG
Subject: Re: Prod server down - services will not stay up


** 
We've turned on both sql and api logging now to capture the next event.
 
How would the db space affect this?  They actually just expanded it last
night.
 
Working with the unix guys in the office and support, just not the same
when you're not there.
 


 
On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys
ben.cher...@softwaretoolhouse.com wrote:


** 
These are a bit of a pain to solve.  SQL logging on startup is
the key.  The logs are quite big but usually the last lines will be
pertinent.  You also need to know the database structure of the
meta-data - given by the database reference guide.
 
I'm afraid that these types of problems are not likely to get
solved whilst in a hotel room as once you have the idea of where the
problem lies (through the log) you then need to research the meta-data
itself.  The sql log will simply let you know the meta-data table that
was last read and not which record of that table caused the server to
crash.
 
7.0.1 p2 seems a little low.  It is possible to patch the
binaries?  
 
It is unlikely that simply allocating more database space is the
problem.  You could also look at the temporary space and see that it was
increased but I would go with the logs first.
 
2 - 10 minutes will most surely be in the servers initial
processing of the meta-data. 
Cheers
Ben




From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer

Sent: November 13, 2009 8:57 PM
To: arslist@ARSLIST.ORG
Subject: Re: Prod server down - services will not stay up


** 
Thanks Ben
 
We're having problems determining where the 11 is coming from.


On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys
ben.cher...@softwaretoolhouse.com wrote:


** 
 
PS.  The 91 is a red herring.  It's the Sig 11 (SEGV)
you need to worry about.   The 91 is another process not being able to
communicate with the arserverd process.
 
Cheers
Ben



From: Ben Chernys
[mailto:ben.cher...@softwaretoolhouse.com] 
Sent: November 13, 2009 8:42 PM
To: 'arslist@ARSLIST.ORG'
Subject: RE: Prod server down - services will not stay
up


The signal 11 is bad code - simple as that.  It's a
segmentation violation which means that the server (arserverd)
attempted to read or write to an address not allocated to its virtual
space.  It can also be caused by a double free or two pointers to one
block which has been freed.  In any event, you cannot fix this without
the ARS source code which I expect you would find hard to get.
 
That being said, the easiest way to determine (and then
circumvent) these types of things is to turn on SQL logging on the
server before the system starts (through the ar.conf file).  The exact
settings are in the configuring ARS guide.
 
Then, when the blow up happens, see what the server was
attempting to do.  You can usually spot some possible internal database
inconsistencies (in ARS meta-data) in this way and then repair them
manually through SQL before the ARS start-up.
 
Additionally, there may be patches available that
address the problem.
 
Cheers
Ben Chernys
 
 



From: Action Request System discussion list(ARSList)
[mailto:arsl...@arslist.org] On Behalf Of Susan Palmer
Sent: November 13, 2009 8:30 PM
To: arslist@ARSLIST.ORG
Subject: Prod server down - services will not stay up


** 
Help !!
 
Working with support but could use anyone else's input.
I'm at WWRUG so it's somewhat limiting.
 
We did a truss log and and when the services drop
(arerror 91) we see the following:
167
/11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064

Re: Prod server down - services will not stay up - RESOLVED

2009-11-13 Thread Susan Palmer
Thank you all for your suggestions.

It was actually an existing problem that manifested itself in a way that it
hasn't in 7 years.  It was fairly unrecognizable.  What was behind that is
unknown.

The short of it is, we have one form that due to 'bad' business reasons has
two ranges of field id 1.  On rare occasion over the last 7 years the next
ID gets incorrectly set by ar/database.  That sorry is too long to go into,
in fact there are problem some archive emails here that I've gone into it in
depth.  Normally it manifests itself by giving a unique key error.  We
correct the problem (less than 1 min) and live goes on.  No impact to anyone
except the one user.

Today we had the double unique situation of the ID reset and it took the
full system down.  When the system came back up and the person entering the
offending record hit save again, the system went down again.  She was very
persistent and tried about 20 times.  We never did see a unique key error so
we didn't look in that direction.

Last June we built all new servers, all new databases ... everything new.
And we still have the problem.  To add a little twist to it, today we did an
experiment on the dev server.  We've never had the issue there because the
thing we knows triggers the ID issue doesn't happen in dev.  When we
simulated the problem there, guess what ... it didn't happen on dev!

Same workflow, same forms, supposedly same server builds, supposedly same
database  builds.

Well we all have our little fluke things to live with.  Hopefully this fluke
will be going away next week because we've finally convinced the 'business'
to stop a bad practice and we're just waiting on a final test.  I hope
they're not just holding the proverbial 'tease' out there and then going to
yank it back.  That sounds a little harsh, but 4 hours troubleshooting today
can be harsh too!

Everyone have a great weekend!  And if you were at WWRUG with me  it was
a great RUG and I'm already looking forward to next year!!  Great job by
Dan, Lenny, Joel and Phil ... we do appreciate the effort!

Thanks,
Susan

On Fri, Nov 13, 2009 at 2:33 PM, Darrell Reading 
darrell.reading...@wal-mart.com wrote:

 **
 *More specifically, check maxsiz.  Is the arserverd core dumping?  What
 size is the core?*


 *Darrell Reading Systems Engineer*
 Phone 479.204.5739
 dere...@wal-mart.com

 Wal-Mart Stores, Inc.
 805 Moberly Lane, MS-0560-68
 Bentonville, AR 72716
 *Save Money. Live Better*


  --
  *From:* Action Request System discussion list(ARSList) [mailto:
 arsl...@arslist.org] *On Behalf Of *Susan Palmer
 *Sent:* Friday, November 13, 2009 14:13

 *To:* arslist@ARSLIST.ORG
 *Subject:* Re: Prod server down - services will not stay up

   **
 We've turned on both sql and api logging now to capture the next event.

 How would the db space affect this?  They actually just expanded it last
 night.

 Working with the unix guys in the office and support, just not the same
 when you're not there.




 On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys 
 ben.cher...@softwaretoolhouse.com wrote:

 **
 These are a bit of a pain to solve.  SQL logging on startup is the key.
 The logs are quite big but usually the last lines will be pertinent.  You
 also need to know the database structure of the meta-data - given by the
 database reference guide.

 I'm afraid that these types of problems are not likely to get solved
 whilst in a hotel room as once you have the idea of where the problem lies
 (through the log) you then need to research the meta-data itself.  The sql
 log will simply let you know the meta-data table that was last read and not
 which record of that table caused the server to crash.

 7.0.1 p2 seems a little low.  It is possible to patch the binaries?

 It is unlikely that simply allocating more database space is the problem.
 You could also look at the temporary space and see that it was increased but
 I would go with the logs first.

 2 - 10 minutes will most surely be in the servers initial processing of
 the meta-data.
 Cheers
 Ben
  --
  *From:* Action Request System discussion list(ARSList) [mailto:
 arsl...@arslist.org] *On Behalf Of *Susan Palmer
 *Sent:* November 13, 2009 8:57 PM
 *To:* arslist@ARSLIST.ORG
 *Subject:* Re: Prod server down - services will not stay up

 **
  Thanks Ben

 We're having problems determining where the 11 is coming from.

 On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys 
 ben.cher...@softwaretoolhouse.com wrote:

 **

  PS.  The 91 is a red herring.  It's the Sig 11 (SEGV) you need to worry
 about.   The 91 is another process not being able to communicate with the
 arserverd process.

 Cheers
 Ben

  --
 *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com]
 *Sent:* November 13, 2009 8:42 PM
 *To:* 'arslist@ARSLIST.ORG'
 *Subject:* RE: Prod server down - services will not stay up

   The signal 11 is bad code - simple as that.  It's a segmentation