Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Shammi Jayasinghe Wed, 17 Oct 2018 12:12:49 -0700

We ll go with the demo tonight and let you know our idea on this.

Thanks
shammi


On Tue, Oct 16, 2018 at 8:56 PM Ruwan Linton <[email protected]> wrote:

> Sorry for the late reply, I missed this thread :-) I have seen a demo of
> this and this looks interesting.
>
> Shammi, I think we need to separate out the 2 concerns of action execution
> and triggering. Provided that we can get this tool to effectively analyze
> the logs and take necessary actions without any lag, the triggers are
> really a question of improving logs to make sure we have the right logs.
>
> I would like the fact that this tool can run external to the VM which is
> running the product, and keep it doing one thing to for stabilization
> purposes. (We cannot get to a situation that the tool is crashing before
> the product :-) so it should do the absolute minimal functionality in
> triggering and actions).
>
> So I would say we need to have some metrics or analytics task reporting
> these (CPU utilization reaching 80%, thread count going beyond a threshold,
> response time going out of the agreed threshold/SLA etc..) as logs to a log
> file and the tool to be monitoring that.
>
> Just my 2 cents.
>
> To me this is a simple practical tool that we can use to support the
> products, I think we should keep the simplicity without making it complex.
>
> Ruwan
>
> On Mon, Oct 8, 2018 at 11:16 PM Shammi Jayasinghe <[email protected]> wrote:
>
>> Hi Thumilan,
>>
>> Have we already implemented this or what is the current status?
>>
>>
>> In a case like "Unable to create new native threads",  Most of the cases,
>> this is due to the open files limitation of the system. So, while tailing
>> and parsing the logs through the analyzer, we can program to check the open
>> files limits in the OS.
>>
>> However, as everybody pointed out above, I would like to suggest you, to
>> modify trigger algorithm in a way that.
>> - It monitors Memory usage
>> - It monitors CPU usage
>> - It monitors logs
>>
>> *Memory usage:* When the heap usage consumes 85%+(Or any reasonable
>> threshold value), It needs to capture a heap dump automatically. With the 
>> *HeapDumpOnOutOfMemoryError
>> *property, It only captures the dump when it started to throw OOM. In
>> that case, there is a capability that it could not perform this action due
>> to some resource limitations in the server or like the process is not
>> responding at all. So, If we can capture the heap dump when the usage
>> exceeds a given threshold value, It would be ideal.
>>
>>
>> *CPU usage:* When the CPU usage of the java process of the WSO2 server
>> is exceeding a given threshold continuously for 5 mins or more than that (
>> a configurable time period and max threshold value), We need to take both
>> of the following with a gap of 1 min or so.
>> - Thread usage
>> - Thread Dump
>>
>> *Log monitoring:*
>> According to the support experience we have, IS throws most of OOM
>> exceptions due to some user store related activities like user creating/
>> loading scenarios. If you can extract such exceptions and use them in
>> trigger algorithms for capturing the system information, that would be
>> ideal.
>>
>>
>>
>> Thanks
>> shammi
>>
>> On Mon, Oct 8, 2018 at 6:08 AM, Chamila De Alwis <[email protected]>
>> wrote:
>>
>>> IMO thread dump(s) are a necessity for almost all OOM stories.
>>> Furthermore, just one thread dump is not a complete look at the thread view
>>> of the system. There should be multiple thread dumps with a predefined
>>> interval to get an understanding of how the internal had behaved before and
>>> during such an error.
>>>
>>> There is an issue here though. We wouldn't have knowledge of an
>>> impending error scenario for us to take multiple thread dumps. So one of
>>> the options is to keep continuous thread dump for a suspected system.
>>> Taking a thread dump usually takes really low amount of CPU time so we
>>> might want to look into that option.
>>>
>>> On the other hand, I'm not sure automated heap dumps would be an ideal
>>> step during a service degradation/downtime. Taking a heap dump is a
>>> resource-hungry operation that sometimes takes multiple minutes. If the
>>> resources are at an already taxed state, this could very well result in a
>>> dead system.
>>>
>>> Additionally, the standard approach for a feedback cycle like this
>>> (error -> trigger -> basic diagnostics) is to enable it *outside* the
>>> system, i.e. a tool that sits outside the (say) IS cluster. That tool would
>>> also feedback into a state machine (an autoscaling system or a node count
>>> maintainer) that spawns new healthy instances while the diagnostics are
>>> happening on the erroneous node (e.g. a system designed based on CloudWatch
>>> Alarms). Though I'm not sure if we want to consider such a wide scope here.
>>>
>>> All in all, the advantages I see from this tool are,
>>> 1. Ability to specify <product> specific stories as triggers
>>> 2. WSO2 specific diagnostic collection
>>>
>>> Are these the only goals in mind?
>>>
>>> Furthermore, have we looked into existing tools that match these
>>> requirements? If so, what tools did we evaluate?
>>>
>>>
>>> Regards,
>>> Chamila de Alwis
>>> Committer and PMC Member - Apache Stratos
>>> Associate Technical Lead | WSO2
>>> +94 77 220 7163
>>> Blog: https://medium.com/@chamilad
>>>
>>>
>>>
>>>
>>> On Thu, Oct 4, 2018 at 2:07 PM Thumilan Mikunthan <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>> IMHO
>>>>
>>>>> 1)  In our WSO2 servers startup script, we do have below java props
>>>>> [1], which basically can create a heap dump when the server has gone OOM.
>>>>> Therefore, I believe here you are trying to solve the problem that the
>>>>> server continues to run, although there is an OOM. IMHO logs are not a
>>>>> suitable mechanism to find whether the system has gone OOM, because we
>>>>> can't certainly produce all kind of logs for OOM error. And also in the
>>>>> proposed method, we can only solve the problem after it has occurred (ie,
>>>>> incur system outage), and we can't prevent it. IMHO, running the 
>>>>> system/JVM
>>>>> monitoring tool which can monitor and alert after exceeding some 
>>>>> percentage
>>>>> of memory usage is the better solution to solve this problem.
>>>>>
>>>>
>>>>>    2) Thread dumps are mostly related to slow response (sometimes no
>>>> response) from the server, and I'm not sure how can we get these details
>>>> from the logs. And we need to intelligently handle the logs, just because
>>>> of some request timeout that doesn't mean that we need to take the thread
>>>> dump, and it can be simply some backend service is down.
>>>>
>>>>
>>>> +1 for the 2). Tool reads all errors but before analyses the error the
>>>> tool validates the error whether captured error log line is good enough to
>>>> do diagnostics.
>>>>
>>>> For the question 1. Let me explain a error scenario.
>>>>
>>>>                               Error -   OOM error: 
>>>> java.lang.OutOfMemoryError:
>>>> unable to create new native thread  .
>>>>
>>>> WSO2 IS server do heap dump because the general error type is OOM. But
>>>> we need thread dumps along with heap dump to resolve the error. Tool reads
>>>> the error line and it able to find out suitable diagnostics while
>>>> analyzing  the error line and finally it does the diagnostics.
>>>>
>>>> For common OOM scenarios doing heap dump is enough. But exceptional
>>>> scenarios like above, error can not only with Heap Dump. So tool reads the
>>>> error log line, beyond memory dump the tool do the further diagnostics such
>>>> as lsof or thread dump.
>>>>
>>>> Finally end user can get all the required diagnostics *at once*.
>>>>
>>>> 3) We have carbon-dump.sh which can dump all the thread-dump,
>>>> heap-dump, relevant details about the server. Can't we use that for this
>>>> purpose?
>>>>
>>>> +1 for 3).
>>>>
>>>> Thank You,
>>>>
>>>> M.Thumilan
>>>>
>>>>
>>>> On Thu, Sep 6, 2018 at 4:15 PM Sinthuja Rajendran <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a few questions/concerns on as stated below.
>>>>>
>>>>> 1)  In our WSO2 servers startup script, we do have below java props
>>>>> [1], which basically can create a heap dump when the server has gone OOM.
>>>>> Therefore, I believe here you are trying to solve the problem that the
>>>>> server continues to run, although there is an OOM. IMHO logs are not a
>>>>> suitable mechanism to find whether the system has gone OOM, because we
>>>>> can't certainly produce all kind of logs for OOM error. And also in the
>>>>> proposed method, we can only solve the problem after it has occurred (ie,
>>>>> incur system outage), and we can't prevent it. IMHO, running the 
>>>>> system/JVM
>>>>> monitoring tool which can monitor and alert after exceeding some 
>>>>> percentage
>>>>> of memory usage is the better solution to solve this problem.
>>>>>
>>>>> 2) Thread dumps are mostly related to slow response (sometimes no
>>>>> response) from the server, and I'm not sure how can we get these details
>>>>> from the logs. And we need to intelligently handle the logs, just because
>>>>> of some request timeout that doesn't mean that we need to take the thread
>>>>> dump, and it can be simply some backend service is down.
>>>>>
>>>>> 3) We have carbon-dump.sh which can dump all the thread-dump,
>>>>> heap-dump, relevant details about the server. Can't we use that for this
>>>>> purpose?
>>>>>
>>>>> [1] -XX:+HeapDumpOnOutOfMemoryError \
>>>>>     -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \
>>>>>
>>>>> Thanks,
>>>>> Sinthuja.
>>>>>
>>>>> On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> *Problem*
>>>>>>
>>>>>> Whenever an error occurred (depending on that error) certain
>>>>>> diagnostics actions can help to diagnose the error.
>>>>>>
>>>>>> For example,
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    If OOM (Out Of Memory) error occured heap dump will help to
>>>>>>    analyse memory leak.
>>>>>>    -
>>>>>>
>>>>>>    If some threads blocked abnormally, analyzing thread dump could
>>>>>>    be able to solve the problem.
>>>>>>
>>>>>> But in real scenario, doing these diagnostic actions manually may not
>>>>>> possible because
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Unable to predict when will the error come.
>>>>>>    -
>>>>>>
>>>>>>    Depending on error diagnostics actions may vary, expecting that
>>>>>>    user acknowledged about all error scenarios is impossible.
>>>>>>    -
>>>>>>
>>>>>>    User willing to take support from support team instead of solving
>>>>>>    the error himself/herself.
>>>>>>
>>>>>>
>>>>>> *Solution*
>>>>>>
>>>>>> Design a stand alone tool which has less memory footprint (<8%) and
>>>>>> less CPU usage (<8%) which has following workflow.
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Log Tailer tails the carbon.log file in real time.
>>>>>>    -
>>>>>>
>>>>>>    Match Rule Engine checks whether current log line and error regex
>>>>>>    are matching or not.
>>>>>>    -
>>>>>>
>>>>>>       Tool has to read error regexs from separate xml file.
>>>>>>       -
>>>>>>
>>>>>>    Interpreter identify the error type and do actions regarding that
>>>>>>    error.
>>>>>>    -
>>>>>>
>>>>>>       Each action should handle by separate action executer.
>>>>>>       -
>>>>>>
>>>>>>       Mapping between errors and actions should be written in
>>>>>>       separate xml file.
>>>>>>       -
>>>>>>
>>>>>>    All the diagnostics files (eg:- thread dumps and heap dumps) for
>>>>>>    a particular error should be created under one folder and zip the 
>>>>>> folder.
>>>>>>    -
>>>>>>
>>>>>>       Each folder can identify by time instance
>>>>>>
>>>>>>
>>>>>> *Architecture Diagram*
>>>>>> [image: ArchitectureDiagram.png]
>>>>>>
>>>>>> *Sample Scenario*
>>>>>>
>>>>>> Assume that client reporting issue about OOM error. He usually
>>>>>> attaches carbon.log file along with the issue. But in order to solve the
>>>>>> problem support team needs thread dump and heap dump. So team requires
>>>>>> client to take those dumps next time. Client has to wait next time and 
>>>>>> take
>>>>>> those dumps. (We can’t expect client to watch the server all the time and
>>>>>> get dumps when error occurs. What if next error occurs at midnight?).
>>>>>> Support team has to wait for the update on that issue. So they put the
>>>>>> issue on pause and goes on.
>>>>>>
>>>>>> Now consider above problem scenario with this tool. Once the error
>>>>>> occurred the tool will take necessary diagnostic actions and zip the
>>>>>> folder. Client can upload that zip folder with the issue so that the
>>>>>> support team doesn’t need client to do those diagnostic actions himself.
>>>>>> The support team able to work on that issue directly without expecting 
>>>>>> any
>>>>>> updates from the client.
>>>>>>
>>>>>> The next time error occurs (even at midnight) tool can detect the
>>>>>> error and send necessary files to support time directly for further
>>>>>> analysis.
>>>>>>
>>>>>> Hence the tool’s memory footprint is small, client can run the tool
>>>>>> without any objection.
>>>>>>
>>>>>> The tool reduces client’s involvement on WSO2 IS errors so that
>>>>>> client can focus on their business. Tool also helps to reduce the time 
>>>>>> that
>>>>>> need to solve the issue because support team could be able to get all
>>>>>> necessary diagnostic files by once at initial conversation.
>>>>>>
>>>>>> Please give feedback regarding this architecture.
>>>>>>
>>>>>> Best Regards,
>>>>>> M.Thumilan
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Sinthuja Rajendran*
>>>>> Senior Technical Lead
>>>>> WSO2, Inc.:http://wso2.com
>>>>>
>>>>> Blog: http://sinthu-rajan.blogspot.com/
>>>>> Mobile: +94774273955
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> M.Thumilan
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>>
>> *  Shammi Jayasinghe*
>>
>>
>> *Senior Technical Lead*
>> *WSO2, Inc.*
>> *+1-812-391-7730*
>> *+1-812-327-3505*
>>
>> *http://shammijayasinghe.blogspot.com
>> <http://shammijayasinghe.blogspot.com>*
>>
>>
>
> --
> Ruwan Linton
> Director - Delivery, WSO2; https://wso2.com
> Member, Apache Software Foundation; http://www.apache.org
> <http://adroitlogic.org/>
>
> email: [email protected]; cell: +94 77 341 3097; phone: +94 11 2833 436
> linkedin: http://www.linkedin.com/in/ruwanlinton
>


-- 
Best Regards,

*  Shammi Jayasinghe*


*Senior Technical Lead*
*WSO2, Inc.*
*+1-812-391-7730*
*+1-812-327-3505*

*http://shammijayasinghe.blogspot.com
<http://shammijayasinghe.blogspot.com>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Reply via email to