We ll go with the demo tonight and let you know our idea on this. Thanks shammi
On Tue, Oct 16, 2018 at 8:56 PM Ruwan Linton <[email protected]> wrote: > Sorry for the late reply, I missed this thread :-) I have seen a demo of > this and this looks interesting. > > Shammi, I think we need to separate out the 2 concerns of action execution > and triggering. Provided that we can get this tool to effectively analyze > the logs and take necessary actions without any lag, the triggers are > really a question of improving logs to make sure we have the right logs. > > I would like the fact that this tool can run external to the VM which is > running the product, and keep it doing one thing to for stabilization > purposes. (We cannot get to a situation that the tool is crashing before > the product :-) so it should do the absolute minimal functionality in > triggering and actions). > > So I would say we need to have some metrics or analytics task reporting > these (CPU utilization reaching 80%, thread count going beyond a threshold, > response time going out of the agreed threshold/SLA etc..) as logs to a log > file and the tool to be monitoring that. > > Just my 2 cents. > > To me this is a simple practical tool that we can use to support the > products, I think we should keep the simplicity without making it complex. > > Ruwan > > On Mon, Oct 8, 2018 at 11:16 PM Shammi Jayasinghe <[email protected]> wrote: > >> Hi Thumilan, >> >> Have we already implemented this or what is the current status? >> >> >> In a case like "Unable to create new native threads", Most of the cases, >> this is due to the open files limitation of the system. So, while tailing >> and parsing the logs through the analyzer, we can program to check the open >> files limits in the OS. >> >> However, as everybody pointed out above, I would like to suggest you, to >> modify trigger algorithm in a way that. >> - It monitors Memory usage >> - It monitors CPU usage >> - It monitors logs >> >> *Memory usage:* When the heap usage consumes 85%+(Or any reasonable >> threshold value), It needs to capture a heap dump automatically. With the >> *HeapDumpOnOutOfMemoryError >> *property, It only captures the dump when it started to throw OOM. In >> that case, there is a capability that it could not perform this action due >> to some resource limitations in the server or like the process is not >> responding at all. So, If we can capture the heap dump when the usage >> exceeds a given threshold value, It would be ideal. >> >> >> *CPU usage:* When the CPU usage of the java process of the WSO2 server >> is exceeding a given threshold continuously for 5 mins or more than that ( >> a configurable time period and max threshold value), We need to take both >> of the following with a gap of 1 min or so. >> - Thread usage >> - Thread Dump >> >> *Log monitoring:* >> According to the support experience we have, IS throws most of OOM >> exceptions due to some user store related activities like user creating/ >> loading scenarios. If you can extract such exceptions and use them in >> trigger algorithms for capturing the system information, that would be >> ideal. >> >> >> >> Thanks >> shammi >> >> On Mon, Oct 8, 2018 at 6:08 AM, Chamila De Alwis <[email protected]> >> wrote: >> >>> IMO thread dump(s) are a necessity for almost all OOM stories. >>> Furthermore, just one thread dump is not a complete look at the thread view >>> of the system. There should be multiple thread dumps with a predefined >>> interval to get an understanding of how the internal had behaved before and >>> during such an error. >>> >>> There is an issue here though. We wouldn't have knowledge of an >>> impending error scenario for us to take multiple thread dumps. So one of >>> the options is to keep continuous thread dump for a suspected system. >>> Taking a thread dump usually takes really low amount of CPU time so we >>> might want to look into that option. >>> >>> On the other hand, I'm not sure automated heap dumps would be an ideal >>> step during a service degradation/downtime. Taking a heap dump is a >>> resource-hungry operation that sometimes takes multiple minutes. If the >>> resources are at an already taxed state, this could very well result in a >>> dead system. >>> >>> Additionally, the standard approach for a feedback cycle like this >>> (error -> trigger -> basic diagnostics) is to enable it *outside* the >>> system, i.e. a tool that sits outside the (say) IS cluster. That tool would >>> also feedback into a state machine (an autoscaling system or a node count >>> maintainer) that spawns new healthy instances while the diagnostics are >>> happening on the erroneous node (e.g. a system designed based on CloudWatch >>> Alarms). Though I'm not sure if we want to consider such a wide scope here. >>> >>> All in all, the advantages I see from this tool are, >>> 1. Ability to specify <product> specific stories as triggers >>> 2. WSO2 specific diagnostic collection >>> >>> Are these the only goals in mind? >>> >>> Furthermore, have we looked into existing tools that match these >>> requirements? If so, what tools did we evaluate? >>> >>> >>> Regards, >>> Chamila de Alwis >>> Committer and PMC Member - Apache Stratos >>> Associate Technical Lead | WSO2 >>> +94 77 220 7163 >>> Blog: https://medium.com/@chamilad >>> >>> >>> >>> >>> On Thu, Oct 4, 2018 at 2:07 PM Thumilan Mikunthan <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> IMHO >>>> >>>>> 1) In our WSO2 servers startup script, we do have below java props >>>>> [1], which basically can create a heap dump when the server has gone OOM. >>>>> Therefore, I believe here you are trying to solve the problem that the >>>>> server continues to run, although there is an OOM. IMHO logs are not a >>>>> suitable mechanism to find whether the system has gone OOM, because we >>>>> can't certainly produce all kind of logs for OOM error. And also in the >>>>> proposed method, we can only solve the problem after it has occurred (ie, >>>>> incur system outage), and we can't prevent it. IMHO, running the >>>>> system/JVM >>>>> monitoring tool which can monitor and alert after exceeding some >>>>> percentage >>>>> of memory usage is the better solution to solve this problem. >>>>> >>>> >>>>> 2) Thread dumps are mostly related to slow response (sometimes no >>>> response) from the server, and I'm not sure how can we get these details >>>> from the logs. And we need to intelligently handle the logs, just because >>>> of some request timeout that doesn't mean that we need to take the thread >>>> dump, and it can be simply some backend service is down. >>>> >>>> >>>> +1 for the 2). Tool reads all errors but before analyses the error the >>>> tool validates the error whether captured error log line is good enough to >>>> do diagnostics. >>>> >>>> For the question 1. Let me explain a error scenario. >>>> >>>> Error - OOM error: >>>> java.lang.OutOfMemoryError: >>>> unable to create new native thread . >>>> >>>> WSO2 IS server do heap dump because the general error type is OOM. But >>>> we need thread dumps along with heap dump to resolve the error. Tool reads >>>> the error line and it able to find out suitable diagnostics while >>>> analyzing the error line and finally it does the diagnostics. >>>> >>>> For common OOM scenarios doing heap dump is enough. But exceptional >>>> scenarios like above, error can not only with Heap Dump. So tool reads the >>>> error log line, beyond memory dump the tool do the further diagnostics such >>>> as lsof or thread dump. >>>> >>>> Finally end user can get all the required diagnostics *at once*. >>>> >>>> 3) We have carbon-dump.sh which can dump all the thread-dump, >>>> heap-dump, relevant details about the server. Can't we use that for this >>>> purpose? >>>> >>>> +1 for 3). >>>> >>>> Thank You, >>>> >>>> M.Thumilan >>>> >>>> >>>> On Thu, Sep 6, 2018 at 4:15 PM Sinthuja Rajendran <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have a few questions/concerns on as stated below. >>>>> >>>>> 1) In our WSO2 servers startup script, we do have below java props >>>>> [1], which basically can create a heap dump when the server has gone OOM. >>>>> Therefore, I believe here you are trying to solve the problem that the >>>>> server continues to run, although there is an OOM. IMHO logs are not a >>>>> suitable mechanism to find whether the system has gone OOM, because we >>>>> can't certainly produce all kind of logs for OOM error. And also in the >>>>> proposed method, we can only solve the problem after it has occurred (ie, >>>>> incur system outage), and we can't prevent it. IMHO, running the >>>>> system/JVM >>>>> monitoring tool which can monitor and alert after exceeding some >>>>> percentage >>>>> of memory usage is the better solution to solve this problem. >>>>> >>>>> 2) Thread dumps are mostly related to slow response (sometimes no >>>>> response) from the server, and I'm not sure how can we get these details >>>>> from the logs. And we need to intelligently handle the logs, just because >>>>> of some request timeout that doesn't mean that we need to take the thread >>>>> dump, and it can be simply some backend service is down. >>>>> >>>>> 3) We have carbon-dump.sh which can dump all the thread-dump, >>>>> heap-dump, relevant details about the server. Can't we use that for this >>>>> purpose? >>>>> >>>>> [1] -XX:+HeapDumpOnOutOfMemoryError \ >>>>> -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \ >>>>> >>>>> Thanks, >>>>> Sinthuja. >>>>> >>>>> On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> *Problem* >>>>>> >>>>>> Whenever an error occurred (depending on that error) certain >>>>>> diagnostics actions can help to diagnose the error. >>>>>> >>>>>> For example, >>>>>> >>>>>> - >>>>>> >>>>>> If OOM (Out Of Memory) error occured heap dump will help to >>>>>> analyse memory leak. >>>>>> - >>>>>> >>>>>> If some threads blocked abnormally, analyzing thread dump could >>>>>> be able to solve the problem. >>>>>> >>>>>> But in real scenario, doing these diagnostic actions manually may not >>>>>> possible because >>>>>> >>>>>> - >>>>>> >>>>>> Unable to predict when will the error come. >>>>>> - >>>>>> >>>>>> Depending on error diagnostics actions may vary, expecting that >>>>>> user acknowledged about all error scenarios is impossible. >>>>>> - >>>>>> >>>>>> User willing to take support from support team instead of solving >>>>>> the error himself/herself. >>>>>> >>>>>> >>>>>> *Solution* >>>>>> >>>>>> Design a stand alone tool which has less memory footprint (<8%) and >>>>>> less CPU usage (<8%) which has following workflow. >>>>>> >>>>>> - >>>>>> >>>>>> Log Tailer tails the carbon.log file in real time. >>>>>> - >>>>>> >>>>>> Match Rule Engine checks whether current log line and error regex >>>>>> are matching or not. >>>>>> - >>>>>> >>>>>> Tool has to read error regexs from separate xml file. >>>>>> - >>>>>> >>>>>> Interpreter identify the error type and do actions regarding that >>>>>> error. >>>>>> - >>>>>> >>>>>> Each action should handle by separate action executer. >>>>>> - >>>>>> >>>>>> Mapping between errors and actions should be written in >>>>>> separate xml file. >>>>>> - >>>>>> >>>>>> All the diagnostics files (eg:- thread dumps and heap dumps) for >>>>>> a particular error should be created under one folder and zip the >>>>>> folder. >>>>>> - >>>>>> >>>>>> Each folder can identify by time instance >>>>>> >>>>>> >>>>>> *Architecture Diagram* >>>>>> [image: ArchitectureDiagram.png] >>>>>> >>>>>> *Sample Scenario* >>>>>> >>>>>> Assume that client reporting issue about OOM error. He usually >>>>>> attaches carbon.log file along with the issue. But in order to solve the >>>>>> problem support team needs thread dump and heap dump. So team requires >>>>>> client to take those dumps next time. Client has to wait next time and >>>>>> take >>>>>> those dumps. (We can’t expect client to watch the server all the time and >>>>>> get dumps when error occurs. What if next error occurs at midnight?). >>>>>> Support team has to wait for the update on that issue. So they put the >>>>>> issue on pause and goes on. >>>>>> >>>>>> Now consider above problem scenario with this tool. Once the error >>>>>> occurred the tool will take necessary diagnostic actions and zip the >>>>>> folder. Client can upload that zip folder with the issue so that the >>>>>> support team doesn’t need client to do those diagnostic actions himself. >>>>>> The support team able to work on that issue directly without expecting >>>>>> any >>>>>> updates from the client. >>>>>> >>>>>> The next time error occurs (even at midnight) tool can detect the >>>>>> error and send necessary files to support time directly for further >>>>>> analysis. >>>>>> >>>>>> Hence the tool’s memory footprint is small, client can run the tool >>>>>> without any objection. >>>>>> >>>>>> The tool reduces client’s involvement on WSO2 IS errors so that >>>>>> client can focus on their business. Tool also helps to reduce the time >>>>>> that >>>>>> need to solve the issue because support team could be able to get all >>>>>> necessary diagnostic files by once at initial conversation. >>>>>> >>>>>> Please give feedback regarding this architecture. >>>>>> >>>>>> Best Regards, >>>>>> M.Thumilan >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Sinthuja Rajendran* >>>>> Senior Technical Lead >>>>> WSO2, Inc.:http://wso2.com >>>>> >>>>> Blog: http://sinthu-rajan.blogspot.com/ >>>>> Mobile: +94774273955 >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Best Regards, >>>> M.Thumilan >>>> >>> >> >> >> -- >> Best Regards, >> >> * Shammi Jayasinghe* >> >> >> *Senior Technical Lead* >> *WSO2, Inc.* >> *+1-812-391-7730* >> *+1-812-327-3505* >> >> *http://shammijayasinghe.blogspot.com >> <http://shammijayasinghe.blogspot.com>* >> >> > > -- > Ruwan Linton > Director - Delivery, WSO2; https://wso2.com > Member, Apache Software Foundation; http://www.apache.org > <http://adroitlogic.org/> > > email: [email protected]; cell: +94 77 341 3097; phone: +94 11 2833 436 > linkedin: http://www.linkedin.com/in/ruwanlinton > -- Best Regards, * Shammi Jayasinghe* *Senior Technical Lead* *WSO2, Inc.* *+1-812-391-7730* *+1-812-327-3505* *http://shammijayasinghe.blogspot.com <http://shammijayasinghe.blogspot.com>*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
