IMO thread dump(s) are a necessity for almost all OOM stories. Furthermore, just one thread dump is not a complete look at the thread view of the system. There should be multiple thread dumps with a predefined interval to get an understanding of how the internal had behaved before and during such an error.
There is an issue here though. We wouldn't have knowledge of an impending error scenario for us to take multiple thread dumps. So one of the options is to keep continuous thread dump for a suspected system. Taking a thread dump usually takes really low amount of CPU time so we might want to look into that option. On the other hand, I'm not sure automated heap dumps would be an ideal step during a service degradation/downtime. Taking a heap dump is a resource-hungry operation that sometimes takes multiple minutes. If the resources are at an already taxed state, this could very well result in a dead system. Additionally, the standard approach for a feedback cycle like this (error -> trigger -> basic diagnostics) is to enable it *outside* the system, i.e. a tool that sits outside the (say) IS cluster. That tool would also feedback into a state machine (an autoscaling system or a node count maintainer) that spawns new healthy instances while the diagnostics are happening on the erroneous node (e.g. a system designed based on CloudWatch Alarms). Though I'm not sure if we want to consider such a wide scope here. All in all, the advantages I see from this tool are, 1. Ability to specify <product> specific stories as triggers 2. WSO2 specific diagnostic collection Are these the only goals in mind? Furthermore, have we looked into existing tools that match these requirements? If so, what tools did we evaluate? Regards, Chamila de Alwis Committer and PMC Member - Apache Stratos Associate Technical Lead | WSO2 +94 77 220 7163 Blog: https://medium.com/@chamilad On Thu, Oct 4, 2018 at 2:07 PM Thumilan Mikunthan <[email protected]> wrote: > Hi all, > IMHO > >> 1) In our WSO2 servers startup script, we do have below java props [1], >> which basically can create a heap dump when the server has gone OOM. >> Therefore, I believe here you are trying to solve the problem that the >> server continues to run, although there is an OOM. IMHO logs are not a >> suitable mechanism to find whether the system has gone OOM, because we >> can't certainly produce all kind of logs for OOM error. And also in the >> proposed method, we can only solve the problem after it has occurred (ie, >> incur system outage), and we can't prevent it. IMHO, running the system/JVM >> monitoring tool which can monitor and alert after exceeding some percentage >> of memory usage is the better solution to solve this problem. >> > >> 2) Thread dumps are mostly related to slow response (sometimes no > response) from the server, and I'm not sure how can we get these details > from the logs. And we need to intelligently handle the logs, just because > of some request timeout that doesn't mean that we need to take the thread > dump, and it can be simply some backend service is down. > > > +1 for the 2). Tool reads all errors but before analyses the error the > tool validates the error whether captured error log line is good enough to > do diagnostics. > > For the question 1. Let me explain a error scenario. > > Error - OOM error: java.lang.OutOfMemoryError: > unable to create new native thread . > > WSO2 IS server do heap dump because the general error type is OOM. But we > need thread dumps along with heap dump to resolve the error. Tool reads the > error line and it able to find out suitable diagnostics while analyzing > the error line and finally it does the diagnostics. > > For common OOM scenarios doing heap dump is enough. But exceptional > scenarios like above, error can not only with Heap Dump. So tool reads the > error log line, beyond memory dump the tool do the further diagnostics such > as lsof or thread dump. > > Finally end user can get all the required diagnostics *at once*. > > 3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump, > relevant details about the server. Can't we use that for this purpose? > > +1 for 3). > > Thank You, > > M.Thumilan > > > On Thu, Sep 6, 2018 at 4:15 PM Sinthuja Rajendran <[email protected]> > wrote: > >> Hi, >> >> I have a few questions/concerns on as stated below. >> >> 1) In our WSO2 servers startup script, we do have below java props [1], >> which basically can create a heap dump when the server has gone OOM. >> Therefore, I believe here you are trying to solve the problem that the >> server continues to run, although there is an OOM. IMHO logs are not a >> suitable mechanism to find whether the system has gone OOM, because we >> can't certainly produce all kind of logs for OOM error. And also in the >> proposed method, we can only solve the problem after it has occurred (ie, >> incur system outage), and we can't prevent it. IMHO, running the system/JVM >> monitoring tool which can monitor and alert after exceeding some percentage >> of memory usage is the better solution to solve this problem. >> >> 2) Thread dumps are mostly related to slow response (sometimes no >> response) from the server, and I'm not sure how can we get these details >> from the logs. And we need to intelligently handle the logs, just because >> of some request timeout that doesn't mean that we need to take the thread >> dump, and it can be simply some backend service is down. >> >> 3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump, >> relevant details about the server. Can't we use that for this purpose? >> >> [1] -XX:+HeapDumpOnOutOfMemoryError \ >> -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \ >> >> Thanks, >> Sinthuja. >> >> On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]> >> wrote: >> >>> Hi all, >>> >>> *Problem* >>> >>> Whenever an error occurred (depending on that error) certain diagnostics >>> actions can help to diagnose the error. >>> >>> For example, >>> >>> - >>> >>> If OOM (Out Of Memory) error occured heap dump will help to analyse >>> memory leak. >>> - >>> >>> If some threads blocked abnormally, analyzing thread dump could be >>> able to solve the problem. >>> >>> But in real scenario, doing these diagnostic actions manually may not >>> possible because >>> >>> - >>> >>> Unable to predict when will the error come. >>> - >>> >>> Depending on error diagnostics actions may vary, expecting that user >>> acknowledged about all error scenarios is impossible. >>> - >>> >>> User willing to take support from support team instead of solving >>> the error himself/herself. >>> >>> >>> *Solution* >>> >>> Design a stand alone tool which has less memory footprint (<8%) and less >>> CPU usage (<8%) which has following workflow. >>> >>> - >>> >>> Log Tailer tails the carbon.log file in real time. >>> - >>> >>> Match Rule Engine checks whether current log line and error regex >>> are matching or not. >>> - >>> >>> Tool has to read error regexs from separate xml file. >>> - >>> >>> Interpreter identify the error type and do actions regarding that >>> error. >>> - >>> >>> Each action should handle by separate action executer. >>> - >>> >>> Mapping between errors and actions should be written in separate >>> xml file. >>> - >>> >>> All the diagnostics files (eg:- thread dumps and heap dumps) for a >>> particular error should be created under one folder and zip the folder. >>> - >>> >>> Each folder can identify by time instance >>> >>> >>> *Architecture Diagram* >>> [image: ArchitectureDiagram.png] >>> >>> *Sample Scenario* >>> >>> Assume that client reporting issue about OOM error. He usually attaches >>> carbon.log file along with the issue. But in order to solve the problem >>> support team needs thread dump and heap dump. So team requires client to >>> take those dumps next time. Client has to wait next time and take those >>> dumps. (We can’t expect client to watch the server all the time and get >>> dumps when error occurs. What if next error occurs at midnight?). Support >>> team has to wait for the update on that issue. So they put the issue on >>> pause and goes on. >>> >>> Now consider above problem scenario with this tool. Once the error >>> occurred the tool will take necessary diagnostic actions and zip the >>> folder. Client can upload that zip folder with the issue so that the >>> support team doesn’t need client to do those diagnostic actions himself. >>> The support team able to work on that issue directly without expecting any >>> updates from the client. >>> >>> The next time error occurs (even at midnight) tool can detect the error >>> and send necessary files to support time directly for further analysis. >>> >>> Hence the tool’s memory footprint is small, client can run the tool >>> without any objection. >>> >>> The tool reduces client’s involvement on WSO2 IS errors so that client >>> can focus on their business. Tool also helps to reduce the time that need >>> to solve the issue because support team could be able to get all necessary >>> diagnostic files by once at initial conversation. >>> >>> Please give feedback regarding this architecture. >>> >>> Best Regards, >>> M.Thumilan >>> >> >> >> -- >> *Sinthuja Rajendran* >> Senior Technical Lead >> WSO2, Inc.:http://wso2.com >> >> Blog: http://sinthu-rajan.blogspot.com/ >> Mobile: +94774273955 >> >> >> > > -- > Best Regards, > M.Thumilan >
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
