Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Thumilan Mikunthan Thu, 04 Oct 2018 01:38:21 -0700

Hi all,
IMHO

> 1)  In our WSO2 servers startup script, we do have below java props [1],
> which basically can create a heap dump when the server has gone OOM.
> Therefore, I believe here you are trying to solve the problem that the
> server continues to run, although there is an OOM. IMHO logs are not a
> suitable mechanism to find whether the system has gone OOM, because we
> can't certainly produce all kind of logs for OOM error. And also in the
> proposed method, we can only solve the problem after it has occurred (ie,
> incur system outage), and we can't prevent it. IMHO, running the system/JVM
> monitoring tool which can monitor and alert after exceeding some percentage
> of memory usage is the better solution to solve this problem.
>


>    2) Thread dumps are mostly related to slow response (sometimes no
response) from the server, and I'm not sure how can we get these details
from the logs. And we need to intelligently handle the logs, just because
of some request timeout that doesn't mean that we need to take the thread
dump, and it can be simply some backend service is down.


+1 for the 2). Tool reads all errors but before analyses the error the tool
validates the error whether captured error log line is good enough to do
diagnostics.

For the question 1. Let me explain a error scenario.

                              Error -   OOM error: java.lang.OutOfMemoryError:
unable to create new native thread  .

WSO2 IS server do heap dump because the general error type is OOM. But we
need thread dumps along with heap dump to resolve the error. Tool reads the
error line and it able to find out suitable diagnostics while analyzing
the error line and finally it does the diagnostics.

For common OOM scenarios doing heap dump is enough. But exceptional
scenarios like above, error can not only with Heap Dump. So tool reads the
error log line, beyond memory dump the tool do the further diagnostics such
as lsof or thread dump.

Finally end user can get all the required diagnostics *at once*.

3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump,
relevant details about the server. Can't we use that for this purpose?

+1 for 3).

Thank You,

M.Thumilan


On Thu, Sep 6, 2018 at 4:15 PM Sinthuja Rajendran <[email protected]> wrote:

> Hi,
>
> I have a few questions/concerns on as stated below.
>
> 1)  In our WSO2 servers startup script, we do have below java props [1],
> which basically can create a heap dump when the server has gone OOM.
> Therefore, I believe here you are trying to solve the problem that the
> server continues to run, although there is an OOM. IMHO logs are not a
> suitable mechanism to find whether the system has gone OOM, because we
> can't certainly produce all kind of logs for OOM error. And also in the
> proposed method, we can only solve the problem after it has occurred (ie,
> incur system outage), and we can't prevent it. IMHO, running the system/JVM
> monitoring tool which can monitor and alert after exceeding some percentage
> of memory usage is the better solution to solve this problem.
>
> 2) Thread dumps are mostly related to slow response (sometimes no
> response) from the server, and I'm not sure how can we get these details
> from the logs. And we need to intelligently handle the logs, just because
> of some request timeout that doesn't mean that we need to take the thread
> dump, and it can be simply some backend service is down.
>
> 3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump,
> relevant details about the server. Can't we use that for this purpose?
>
> [1] -XX:+HeapDumpOnOutOfMemoryError \
>     -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \
>
> Thanks,
> Sinthuja.
>
> On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]>
> wrote:
>
>> Hi all,
>>
>> *Problem*
>>
>> Whenever an error occurred (depending on that error) certain diagnostics
>> actions can help to diagnose the error.
>>
>> For example,
>>
>>    -
>>
>>    If OOM (Out Of Memory) error occured heap dump will help to analyse
>>    memory leak.
>>    -
>>
>>    If some threads blocked abnormally, analyzing thread dump could be
>>    able to solve the problem.
>>
>> But in real scenario, doing these diagnostic actions manually may not
>> possible because
>>
>>    -
>>
>>    Unable to predict when will the error come.
>>    -
>>
>>    Depending on error diagnostics actions may vary, expecting that user
>>    acknowledged about all error scenarios is impossible.
>>    -
>>
>>    User willing to take support from support team instead of solving the
>>    error himself/herself.
>>
>>
>> *Solution*
>>
>> Design a stand alone tool which has less memory footprint (<8%) and less
>> CPU usage (<8%) which has following workflow.
>>
>>    -
>>
>>    Log Tailer tails the carbon.log file in real time.
>>    -
>>
>>    Match Rule Engine checks whether current log line and error regex are
>>    matching or not.
>>    -
>>
>>       Tool has to read error regexs from separate xml file.
>>       -
>>
>>    Interpreter identify the error type and do actions regarding that
>>    error.
>>    -
>>
>>       Each action should handle by separate action executer.
>>       -
>>
>>       Mapping between errors and actions should be written in separate
>>       xml file.
>>       -
>>
>>    All the diagnostics files (eg:- thread dumps and heap dumps) for a
>>    particular error should be created under one folder and zip the folder.
>>    -
>>
>>       Each folder can identify by time instance
>>
>>
>> *Architecture Diagram*
>> [image: ArchitectureDiagram.png]
>>
>> *Sample Scenario*
>>
>> Assume that client reporting issue about OOM error. He usually attaches
>> carbon.log file along with the issue. But in order to solve the problem
>> support team needs thread dump and heap dump. So team requires client to
>> take those dumps next time. Client has to wait next time and take those
>> dumps. (We can’t expect client to watch the server all the time and get
>> dumps when error occurs. What if next error occurs at midnight?). Support
>> team has to wait for the update on that issue. So they put the issue on
>> pause and goes on.
>>
>> Now consider above problem scenario with this tool. Once the error
>> occurred the tool will take necessary diagnostic actions and zip the
>> folder. Client can upload that zip folder with the issue so that the
>> support team doesn’t need client to do those diagnostic actions himself.
>> The support team able to work on that issue directly without expecting any
>> updates from the client.
>>
>> The next time error occurs (even at midnight) tool can detect the error
>> and send necessary files to support time directly for further analysis.
>>
>> Hence the tool’s memory footprint is small, client can run the tool
>> without any objection.
>>
>> The tool reduces client’s involvement on WSO2 IS errors so that client
>> can focus on their business. Tool also helps to reduce the time that need
>> to solve the issue because support team could be able to get all necessary
>> diagnostic files by once at initial conversation.
>>
>> Please give feedback regarding this architecture.
>>
>> Best Regards,
>> M.Thumilan
>>
>
>
> --
> *Sinthuja Rajendran*
> Senior Technical Lead
> WSO2, Inc.:http://wso2.com
>
> Blog: http://sinthu-rajan.blogspot.com/
> Mobile: +94774273955
>
>
>

-- 
Best Regards,
M.Thumilan

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Reply via email to