Re: Prod server down - services will not stay up
The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys _ From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:30 PM To: arslist@ARSLIST.ORG Subject: Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ ___ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are
Re: Prod server down - services will not stay up
PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben _ From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] Sent: November 13, 2009 8:42 PM To: 'arslist@ARSLIST.ORG' Subject: RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys _ From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:30 PM To: arslist@ARSLIST.ORG Subject: Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ ___ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are
Re: Prod server down - services will not stay up
Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben -- *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] *Sent:* November 13, 2009 8:42 PM *To:* 'arslist@ARSLIST.ORG' *Subject:* RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys -- *From:* Action Request System discussion list(ARSList) [mailto: arsl...@arslist.org] *On Behalf Of *Susan Palmer *Sent:* November 13, 2009 8:30 PM *To:* arslist@ARSLIST.ORG *Subject:* Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ ___ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are
Re: Prod server down - services will not stay up
Did someone change kernel setting on the system last night when they did the DB changes? What size is the executable upon getting the SEGV? If remedy has been running fine up to this point and is crashing that often, then something changed. Darrell Reading Systems Engineer Phone 479.204.5739 dere...@wal-mart.com Wal-Mart Stores, Inc. 805 Moberly Lane, MS-0560-68 Bentonville, AR 72716 Save Money. Live Better From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: Friday, November 13, 2009 13:57 To: arslist@ARSLIST.ORG Subject: Re: Prod server down - services will not stay up ** Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] Sent: November 13, 2009 8:42 PM To: 'arslist@ARSLIST.ORG' Subject: RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:30 PM To: arslist@ARSLIST.ORG Subject: Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ - ** This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. ** Wal-Mart Confidential
Re: Prod server down - services will not stay up
These are a bit of a pain to solve. SQL logging on startup is the key. The logs are quite big but usually the last lines will be pertinent. You also need to know the database structure of the meta-data - given by the database reference guide. I'm afraid that these types of problems are not likely to get solved whilst in a hotel room as once you have the idea of where the problem lies (through the log) you then need to research the meta-data itself. The sql log will simply let you know the meta-data table that was last read and not which record of that table caused the server to crash. 7.0.1 p2 seems a little low. It is possible to patch the binaries? It is unlikely that simply allocating more database space is the problem. You could also look at the temporary space and see that it was increased but I would go with the logs first. 2 - 10 minutes will most surely be in the servers initial processing of the meta-data. Cheers Ben _ From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:57 PM To: arslist@ARSLIST.ORG Subject: Re: Prod server down - services will not stay up ** Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben _ From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] Sent: November 13, 2009 8:42 PM To: 'arslist@ARSLIST.ORG' Subject: RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys _ From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:30 PM To: arslist@ARSLIST.ORG Subject: Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ ___ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are
Re: Prod server down - services will not stay up
In the arerror log, you will see a line that has nothing in it except the 11. It is the arserverd daemon. _ From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:57 PM To: arslist@ARSLIST.ORG Subject: Re: Prod server down - services will not stay up ** Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben _ From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] Sent: November 13, 2009 8:42 PM To: 'arslist@ARSLIST.ORG' Subject: RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys _ From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:30 PM To: arslist@ARSLIST.ORG Subject: Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ ___ UNSUBSCRIBE or access ARSlist Archives at www.arslist.org Platinum Sponsor:rmisoluti...@verizon.net ARSlist: Where the Answers Are
Re: Prod server down - services will not stay up
We've turned on both sql and api logging now to capture the next event. How would the db space affect this? They actually just expanded it last night. Working with the unix guys in the office and support, just not the same when you're not there. On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** These are a bit of a pain to solve. SQL logging on startup is the key. The logs are quite big but usually the last lines will be pertinent. You also need to know the database structure of the meta-data - given by the database reference guide. I'm afraid that these types of problems are not likely to get solved whilst in a hotel room as once you have the idea of where the problem lies (through the log) you then need to research the meta-data itself. The sql log will simply let you know the meta-data table that was last read and not which record of that table caused the server to crash. 7.0.1 p2 seems a little low. It is possible to patch the binaries? It is unlikely that simply allocating more database space is the problem. You could also look at the temporary space and see that it was increased but I would go with the logs first. 2 - 10 minutes will most surely be in the servers initial processing of the meta-data. Cheers Ben -- *From:* Action Request System discussion list(ARSList) [mailto: arsl...@arslist.org] *On Behalf Of *Susan Palmer *Sent:* November 13, 2009 8:57 PM *To:* arslist@ARSLIST.ORG *Subject:* Re: Prod server down - services will not stay up ** Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben -- *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] *Sent:* November 13, 2009 8:42 PM *To:* 'arslist@ARSLIST.ORG' *Subject:* RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys -- *From:* Action Request System discussion list(ARSList) [mailto: arsl...@arslist.org] *On Behalf Of *Susan Palmer *Sent:* November 13, 2009 8:30 PM *To:* arslist@ARSLIST.ORG *Subject:* Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064)= 254 /11:write(54, \0A1\0\006\0\0\0\0\003 ^.., 161)= 161 /11:read(54, \0F7\0\006\0\0\0\0\01017.., 2064)= 247 /11:Incurred fault #6, FLTBOUNDS %pc = 0xFE6A3558 /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C /11:Received signal #11, SIGSEGV [caught] /11: siginfo: SIGSEGV SEGV_MAPERR addr=0xFB47FB4C The services do restart automatically so armonitor is doing it's job. We've commented out everything from armonitor but the arserverd command. We stay up for between 2-10 minutes and then wham, we're down again. Obviously this just started this morning. unix sun solaris 10 oracle 10g ars 7.0.1P2 They did expand the database size last night if that has any bearing. But we can connect to the database successfully when ar is down. Nothing else helpful in arerror.log, only 91 error. I'm at the Hardrock hotel, call room 30601 if you have questions or can help! Thanks, Susan _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_ _Platinum Sponsor: rmisoluti...@verizon.net ARSlist: Where the Answers Are_
Re: Prod server down - services will not stay up
More specifically, check maxsiz. Is the arserverd core dumping? What size is the core? Darrell Reading Systems Engineer Phone 479.204.5739 dere...@wal-mart.com Wal-Mart Stores, Inc. 805 Moberly Lane, MS-0560-68 Bentonville, AR 72716 Save Money. Live Better From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: Friday, November 13, 2009 14:13 To: arslist@ARSLIST.ORG Subject: Re: Prod server down - services will not stay up ** We've turned on both sql and api logging now to capture the next event. How would the db space affect this? They actually just expanded it last night. Working with the unix guys in the office and support, just not the same when you're not there. On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** These are a bit of a pain to solve. SQL logging on startup is the key. The logs are quite big but usually the last lines will be pertinent. You also need to know the database structure of the meta-data - given by the database reference guide. I'm afraid that these types of problems are not likely to get solved whilst in a hotel room as once you have the idea of where the problem lies (through the log) you then need to research the meta-data itself. The sql log will simply let you know the meta-data table that was last read and not which record of that table caused the server to crash. 7.0.1 p2 seems a little low. It is possible to patch the binaries? It is unlikely that simply allocating more database space is the problem. You could also look at the temporary space and see that it was increased but I would go with the logs first. 2 - 10 minutes will most surely be in the servers initial processing of the meta-data. Cheers Ben From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:57 PM To: arslist@ARSLIST.ORG Subject: Re: Prod server down - services will not stay up ** Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben From: Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] Sent: November 13, 2009 8:42 PM To: 'arslist@ARSLIST.ORG' Subject: RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation violation which means that the server (arserverd) attempted to read or write to an address not allocated to its virtual space. It can also be caused by a double free or two pointers to one block which has been freed. In any event, you cannot fix this without the ARS source code which I expect you would find hard to get. That being said, the easiest way to determine (and then circumvent) these types of things is to turn on SQL logging on the server before the system starts (through the ar.conf file). The exact settings are in the configuring ARS guide. Then, when the blow up happens, see what the server was attempting to do. You can usually spot some possible internal database inconsistencies (in ARS meta-data) in this way and then repair them manually through SQL before the ARS start-up. Additionally, there may be patches available that address the problem. Cheers Ben Chernys From: Action Request System discussion list(ARSList) [mailto:arsl...@arslist.org] On Behalf Of Susan Palmer Sent: November 13, 2009 8:30 PM To: arslist@ARSLIST.ORG Subject: Prod server down - services will not stay up ** Help !! Working with support but could use anyone else's input. I'm at WWRUG so it's somewhat limiting. We did a truss log and and when the services drop (arerror 91) we see the following: 167 /11:read(54, \0FE\0\006\0\0\0\0\01017.., 2064
Re: Prod server down - services will not stay up - RESOLVED
Thank you all for your suggestions. It was actually an existing problem that manifested itself in a way that it hasn't in 7 years. It was fairly unrecognizable. What was behind that is unknown. The short of it is, we have one form that due to 'bad' business reasons has two ranges of field id 1. On rare occasion over the last 7 years the next ID gets incorrectly set by ar/database. That sorry is too long to go into, in fact there are problem some archive emails here that I've gone into it in depth. Normally it manifests itself by giving a unique key error. We correct the problem (less than 1 min) and live goes on. No impact to anyone except the one user. Today we had the double unique situation of the ID reset and it took the full system down. When the system came back up and the person entering the offending record hit save again, the system went down again. She was very persistent and tried about 20 times. We never did see a unique key error so we didn't look in that direction. Last June we built all new servers, all new databases ... everything new. And we still have the problem. To add a little twist to it, today we did an experiment on the dev server. We've never had the issue there because the thing we knows triggers the ID issue doesn't happen in dev. When we simulated the problem there, guess what ... it didn't happen on dev! Same workflow, same forms, supposedly same server builds, supposedly same database builds. Well we all have our little fluke things to live with. Hopefully this fluke will be going away next week because we've finally convinced the 'business' to stop a bad practice and we're just waiting on a final test. I hope they're not just holding the proverbial 'tease' out there and then going to yank it back. That sounds a little harsh, but 4 hours troubleshooting today can be harsh too! Everyone have a great weekend! And if you were at WWRUG with me it was a great RUG and I'm already looking forward to next year!! Great job by Dan, Lenny, Joel and Phil ... we do appreciate the effort! Thanks, Susan On Fri, Nov 13, 2009 at 2:33 PM, Darrell Reading darrell.reading...@wal-mart.com wrote: ** *More specifically, check maxsiz. Is the arserverd core dumping? What size is the core?* *Darrell Reading Systems Engineer* Phone 479.204.5739 dere...@wal-mart.com Wal-Mart Stores, Inc. 805 Moberly Lane, MS-0560-68 Bentonville, AR 72716 *Save Money. Live Better* -- *From:* Action Request System discussion list(ARSList) [mailto: arsl...@arslist.org] *On Behalf Of *Susan Palmer *Sent:* Friday, November 13, 2009 14:13 *To:* arslist@ARSLIST.ORG *Subject:* Re: Prod server down - services will not stay up ** We've turned on both sql and api logging now to capture the next event. How would the db space affect this? They actually just expanded it last night. Working with the unix guys in the office and support, just not the same when you're not there. On Fri, Nov 13, 2009 at 2:08 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** These are a bit of a pain to solve. SQL logging on startup is the key. The logs are quite big but usually the last lines will be pertinent. You also need to know the database structure of the meta-data - given by the database reference guide. I'm afraid that these types of problems are not likely to get solved whilst in a hotel room as once you have the idea of where the problem lies (through the log) you then need to research the meta-data itself. The sql log will simply let you know the meta-data table that was last read and not which record of that table caused the server to crash. 7.0.1 p2 seems a little low. It is possible to patch the binaries? It is unlikely that simply allocating more database space is the problem. You could also look at the temporary space and see that it was increased but I would go with the logs first. 2 - 10 minutes will most surely be in the servers initial processing of the meta-data. Cheers Ben -- *From:* Action Request System discussion list(ARSList) [mailto: arsl...@arslist.org] *On Behalf Of *Susan Palmer *Sent:* November 13, 2009 8:57 PM *To:* arslist@ARSLIST.ORG *Subject:* Re: Prod server down - services will not stay up ** Thanks Ben We're having problems determining where the 11 is coming from. On Fri, Nov 13, 2009 at 1:49 PM, Ben Chernys ben.cher...@softwaretoolhouse.com wrote: ** PS. The 91 is a red herring. It's the Sig 11 (SEGV) you need to worry about. The 91 is another process not being able to communicate with the arserverd process. Cheers Ben -- *From:* Ben Chernys [mailto:ben.cher...@softwaretoolhouse.com] *Sent:* November 13, 2009 8:42 PM *To:* 'arslist@ARSLIST.ORG' *Subject:* RE: Prod server down - services will not stay up The signal 11 is bad code - simple as that. It's a segmentation