Sorry for the late reply, I missed this thread :-) I have seen a demo of this and this looks interesting.
Shammi, I think we need to separate out the 2 concerns of action execution and triggering. Provided that we can get this tool to effectively analyze the logs and take necessary actions without any lag, the triggers are really a question of improving logs to make sure we have the right logs. I would like the fact that this tool can run external to the VM which is running the product, and keep it doing one thing to for stabilization purposes. (We cannot get to a situation that the tool is crashing before the product :-) so it should do the absolute minimal functionality in triggering and actions). So I would say we need to have some metrics or analytics task reporting these (CPU utilization reaching 80%, thread count going beyond a threshold, response time going out of the agreed threshold/SLA etc..) as logs to a log file and the tool to be monitoring that. Just my 2 cents. To me this is a simple practical tool that we can use to support the products, I think we should keep the simplicity without making it complex. Ruwan On Mon, Oct 8, 2018 at 11:16 PM Shammi Jayasinghe <[email protected]> wrote: > Hi Thumilan, > > Have we already implemented this or what is the current status? > > > In a case like "Unable to create new native threads", Most of the cases, > this is due to the open files limitation of the system. So, while tailing > and parsing the logs through the analyzer, we can program to check the open > files limits in the OS. > > However, as everybody pointed out above, I would like to suggest you, to > modify trigger algorithm in a way that. > - It monitors Memory usage > - It monitors CPU usage > - It monitors logs > > *Memory usage:* When the heap usage consumes 85%+(Or any reasonable > threshold value), It needs to capture a heap dump automatically. With the > *HeapDumpOnOutOfMemoryError > *property, It only captures the dump when it started to throw OOM. In > that case, there is a capability that it could not perform this action due > to some resource limitations in the server or like the process is not > responding at all. So, If we can capture the heap dump when the usage > exceeds a given threshold value, It would be ideal. > > > *CPU usage:* When the CPU usage of the java process of the WSO2 server is > exceeding a given threshold continuously for 5 mins or more than that ( a > configurable time period and max threshold value), We need to take both of > the following with a gap of 1 min or so. > - Thread usage > - Thread Dump > > *Log monitoring:* > According to the support experience we have, IS throws most of OOM > exceptions due to some user store related activities like user creating/ > loading scenarios. If you can extract such exceptions and use them in > trigger algorithms for capturing the system information, that would be > ideal. > > > > Thanks > shammi > > On Mon, Oct 8, 2018 at 6:08 AM, Chamila De Alwis <[email protected]> > wrote: > >> IMO thread dump(s) are a necessity for almost all OOM stories. >> Furthermore, just one thread dump is not a complete look at the thread view >> of the system. There should be multiple thread dumps with a predefined >> interval to get an understanding of how the internal had behaved before and >> during such an error. >> >> There is an issue here though. We wouldn't have knowledge of an impending >> error scenario for us to take multiple thread dumps. So one of the options >> is to keep continuous thread dump for a suspected system. Taking a thread >> dump usually takes really low amount of CPU time so we might want to look >> into that option. >> >> On the other hand, I'm not sure automated heap dumps would be an ideal >> step during a service degradation/downtime. Taking a heap dump is a >> resource-hungry operation that sometimes takes multiple minutes. If the >> resources are at an already taxed state, this could very well result in a >> dead system. >> >> Additionally, the standard approach for a feedback cycle like this (error >> -> trigger -> basic diagnostics) is to enable it *outside* the system, >> i.e. a tool that sits outside the (say) IS cluster. That tool would also >> feedback into a state machine (an autoscaling system or a node count >> maintainer) that spawns new healthy instances while the diagnostics are >> happening on the erroneous node (e.g. a system designed based on CloudWatch >> Alarms). Though I'm not sure if we want to consider such a wide scope here. >> >> All in all, the advantages I see from this tool are, >> 1. Ability to specify <product> specific stories as triggers >> 2. WSO2 specific diagnostic collection >> >> Are these the only goals in mind? >> >> Furthermore, have we looked into existing tools that match these >> requirements? If so, what tools did we evaluate? >> >> >> Regards, >> Chamila de Alwis >> Committer and PMC Member - Apache Stratos >> Associate Technical Lead | WSO2 >> +94 77 220 7163 >> Blog: https://medium.com/@chamilad >> >> >> >> >> On Thu, Oct 4, 2018 at 2:07 PM Thumilan Mikunthan <[email protected]> >> wrote: >> >>> Hi all, >>> IMHO >>> >>>> 1) In our WSO2 servers startup script, we do have below java props >>>> [1], which basically can create a heap dump when the server has gone OOM. >>>> Therefore, I believe here you are trying to solve the problem that the >>>> server continues to run, although there is an OOM. IMHO logs are not a >>>> suitable mechanism to find whether the system has gone OOM, because we >>>> can't certainly produce all kind of logs for OOM error. And also in the >>>> proposed method, we can only solve the problem after it has occurred (ie, >>>> incur system outage), and we can't prevent it. IMHO, running the system/JVM >>>> monitoring tool which can monitor and alert after exceeding some percentage >>>> of memory usage is the better solution to solve this problem. >>>> >>> >>>> 2) Thread dumps are mostly related to slow response (sometimes no >>> response) from the server, and I'm not sure how can we get these details >>> from the logs. And we need to intelligently handle the logs, just because >>> of some request timeout that doesn't mean that we need to take the thread >>> dump, and it can be simply some backend service is down. >>> >>> >>> +1 for the 2). Tool reads all errors but before analyses the error the >>> tool validates the error whether captured error log line is good enough to >>> do diagnostics. >>> >>> For the question 1. Let me explain a error scenario. >>> >>> Error - OOM error: >>> java.lang.OutOfMemoryError: >>> unable to create new native thread . >>> >>> WSO2 IS server do heap dump because the general error type is OOM. But >>> we need thread dumps along with heap dump to resolve the error. Tool reads >>> the error line and it able to find out suitable diagnostics while >>> analyzing the error line and finally it does the diagnostics. >>> >>> For common OOM scenarios doing heap dump is enough. But exceptional >>> scenarios like above, error can not only with Heap Dump. So tool reads the >>> error log line, beyond memory dump the tool do the further diagnostics such >>> as lsof or thread dump. >>> >>> Finally end user can get all the required diagnostics *at once*. >>> >>> 3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump, >>> relevant details about the server. Can't we use that for this purpose? >>> >>> +1 for 3). >>> >>> Thank You, >>> >>> M.Thumilan >>> >>> >>> On Thu, Sep 6, 2018 at 4:15 PM Sinthuja Rajendran <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I have a few questions/concerns on as stated below. >>>> >>>> 1) In our WSO2 servers startup script, we do have below java props >>>> [1], which basically can create a heap dump when the server has gone OOM. >>>> Therefore, I believe here you are trying to solve the problem that the >>>> server continues to run, although there is an OOM. IMHO logs are not a >>>> suitable mechanism to find whether the system has gone OOM, because we >>>> can't certainly produce all kind of logs for OOM error. And also in the >>>> proposed method, we can only solve the problem after it has occurred (ie, >>>> incur system outage), and we can't prevent it. IMHO, running the system/JVM >>>> monitoring tool which can monitor and alert after exceeding some percentage >>>> of memory usage is the better solution to solve this problem. >>>> >>>> 2) Thread dumps are mostly related to slow response (sometimes no >>>> response) from the server, and I'm not sure how can we get these details >>>> from the logs. And we need to intelligently handle the logs, just because >>>> of some request timeout that doesn't mean that we need to take the thread >>>> dump, and it can be simply some backend service is down. >>>> >>>> 3) We have carbon-dump.sh which can dump all the thread-dump, >>>> heap-dump, relevant details about the server. Can't we use that for this >>>> purpose? >>>> >>>> [1] -XX:+HeapDumpOnOutOfMemoryError \ >>>> -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \ >>>> >>>> Thanks, >>>> Sinthuja. >>>> >>>> On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> *Problem* >>>>> >>>>> Whenever an error occurred (depending on that error) certain >>>>> diagnostics actions can help to diagnose the error. >>>>> >>>>> For example, >>>>> >>>>> - >>>>> >>>>> If OOM (Out Of Memory) error occured heap dump will help to >>>>> analyse memory leak. >>>>> - >>>>> >>>>> If some threads blocked abnormally, analyzing thread dump could be >>>>> able to solve the problem. >>>>> >>>>> But in real scenario, doing these diagnostic actions manually may not >>>>> possible because >>>>> >>>>> - >>>>> >>>>> Unable to predict when will the error come. >>>>> - >>>>> >>>>> Depending on error diagnostics actions may vary, expecting that >>>>> user acknowledged about all error scenarios is impossible. >>>>> - >>>>> >>>>> User willing to take support from support team instead of solving >>>>> the error himself/herself. >>>>> >>>>> >>>>> *Solution* >>>>> >>>>> Design a stand alone tool which has less memory footprint (<8%) and >>>>> less CPU usage (<8%) which has following workflow. >>>>> >>>>> - >>>>> >>>>> Log Tailer tails the carbon.log file in real time. >>>>> - >>>>> >>>>> Match Rule Engine checks whether current log line and error regex >>>>> are matching or not. >>>>> - >>>>> >>>>> Tool has to read error regexs from separate xml file. >>>>> - >>>>> >>>>> Interpreter identify the error type and do actions regarding that >>>>> error. >>>>> - >>>>> >>>>> Each action should handle by separate action executer. >>>>> - >>>>> >>>>> Mapping between errors and actions should be written in >>>>> separate xml file. >>>>> - >>>>> >>>>> All the diagnostics files (eg:- thread dumps and heap dumps) for a >>>>> particular error should be created under one folder and zip the folder. >>>>> - >>>>> >>>>> Each folder can identify by time instance >>>>> >>>>> >>>>> *Architecture Diagram* >>>>> [image: ArchitectureDiagram.png] >>>>> >>>>> *Sample Scenario* >>>>> >>>>> Assume that client reporting issue about OOM error. He usually >>>>> attaches carbon.log file along with the issue. But in order to solve the >>>>> problem support team needs thread dump and heap dump. So team requires >>>>> client to take those dumps next time. Client has to wait next time and >>>>> take >>>>> those dumps. (We can’t expect client to watch the server all the time and >>>>> get dumps when error occurs. What if next error occurs at midnight?). >>>>> Support team has to wait for the update on that issue. So they put the >>>>> issue on pause and goes on. >>>>> >>>>> Now consider above problem scenario with this tool. Once the error >>>>> occurred the tool will take necessary diagnostic actions and zip the >>>>> folder. Client can upload that zip folder with the issue so that the >>>>> support team doesn’t need client to do those diagnostic actions himself. >>>>> The support team able to work on that issue directly without expecting any >>>>> updates from the client. >>>>> >>>>> The next time error occurs (even at midnight) tool can detect the >>>>> error and send necessary files to support time directly for further >>>>> analysis. >>>>> >>>>> Hence the tool’s memory footprint is small, client can run the tool >>>>> without any objection. >>>>> >>>>> The tool reduces client’s involvement on WSO2 IS errors so that client >>>>> can focus on their business. Tool also helps to reduce the time that need >>>>> to solve the issue because support team could be able to get all necessary >>>>> diagnostic files by once at initial conversation. >>>>> >>>>> Please give feedback regarding this architecture. >>>>> >>>>> Best Regards, >>>>> M.Thumilan >>>>> >>>> >>>> >>>> -- >>>> *Sinthuja Rajendran* >>>> Senior Technical Lead >>>> WSO2, Inc.:http://wso2.com >>>> >>>> Blog: http://sinthu-rajan.blogspot.com/ >>>> Mobile: +94774273955 >>>> >>>> >>>> >>> >>> -- >>> Best Regards, >>> M.Thumilan >>> >> > > > -- > Best Regards, > > * Shammi Jayasinghe* > > > *Senior Technical Lead* > *WSO2, Inc.* > *+1-812-391-7730* > *+1-812-327-3505* > > *http://shammijayasinghe.blogspot.com > <http://shammijayasinghe.blogspot.com>* > > -- Ruwan Linton Director - Delivery, WSO2; https://wso2.com Member, Apache Software Foundation; http://www.apache.org <http://adroitlogic.org/> email: [email protected]; cell: +94 77 341 3097; phone: +94 11 2833 436 linkedin: http://www.linkedin.com/in/ruwanlinton
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
