Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Asela Pathberiya Fri, 07 Sep 2018 06:59:06 -0700

On Thu, Sep 6, 2018 at 4:15 PM, Sinthuja Rajendran <[email protected]>
wrote:


> Hi,
>
> I have a few questions/concerns on as stated below.
>
> 1)  In our WSO2 servers startup script, we do have below java props [1],
> which basically can create a heap dump when the server has gone OOM.
> Therefore, I believe here you are trying to solve the problem that the
> server continues to run, although there is an OOM. IMHO logs are not a
> suitable mechanism to find whether the system has gone OOM, because we
> can't certainly produce all kind of logs for OOM error. And also in the
> proposed method, we can only solve the problem after it has occurred (ie,
> incur system outage), and we can't prevent it. IMHO, running the system/JVM
> monitoring tool which can monitor and alert after exceeding some percentage
> of memory usage is the better solution to solve this problem.
>

> 2) Thread dumps are mostly related to slow response (sometimes no
> response) from the server, and I'm not sure how can we get these details
> from the logs. And we need to intelligently handle the logs, just because
> of some request timeout that doesn't mean that we need to take the thread
> dump, and it can be simply some backend service is down.
>
> 3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump,
> relevant details about the server. Can't we use that for this purpose?
>

Yes!  Looking at the logs to take heap/thread dumps, would not help much.

Is there any other dump or data which you are hoping to zip ?

Also;  How does this specific to IS ?  Is there any special diagnosis which
you are hoping for IS ? If it is, what are them ?

Thanks,
Asela.


>
> [1] -XX:+HeapDumpOnOutOfMemoryError \
>     -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \
>
> Thanks,
> Sinthuja.
>
> On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]>
> wrote:
>
>> Hi all,
>>
>> *Problem*
>>
>> Whenever an error occurred (depending on that error) certain diagnostics
>> actions can help to diagnose the error.
>>
>> For example,
>>
>>    -
>>
>>    If OOM (Out Of Memory) error occured heap dump will help to analyse
>>    memory leak.
>>    -
>>
>>    If some threads blocked abnormally, analyzing thread dump could be
>>    able to solve the problem.
>>
>> But in real scenario, doing these diagnostic actions manually may not
>> possible because
>>
>>    -
>>
>>    Unable to predict when will the error come.
>>    -
>>
>>    Depending on error diagnostics actions may vary, expecting that user
>>    acknowledged about all error scenarios is impossible.
>>    -
>>
>>    User willing to take support from support team instead of solving the
>>    error himself/herself.
>>
>>
>> *Solution*
>>
>> Design a stand alone tool which has less memory footprint (<8%) and less
>> CPU usage (<8%) which has following workflow.
>>
>>    -
>>
>>    Log Tailer tails the carbon.log file in real time.
>>    -
>>
>>    Match Rule Engine checks whether current log line and error regex are
>>    matching or not.
>>    -
>>
>>       Tool has to read error regexs from separate xml file.
>>       -
>>
>>    Interpreter identify the error type and do actions regarding that
>>    error.
>>    -
>>
>>       Each action should handle by separate action executer.
>>       -
>>
>>       Mapping between errors and actions should be written in separate
>>       xml file.
>>       -
>>
>>    All the diagnostics files (eg:- thread dumps and heap dumps) for a
>>    particular error should be created under one folder and zip the folder.
>>    -
>>
>>       Each folder can identify by time instance
>>
>>
>> *Architecture Diagram*
>> [image: ArchitectureDiagram.png]
>>
>> *Sample Scenario*
>>
>> Assume that client reporting issue about OOM error. He usually attaches
>> carbon.log file along with the issue. But in order to solve the problem
>> support team needs thread dump and heap dump. So team requires client to
>> take those dumps next time. Client has to wait next time and take those
>> dumps. (We can’t expect client to watch the server all the time and get
>> dumps when error occurs. What if next error occurs at midnight?). Support
>> team has to wait for the update on that issue. So they put the issue on
>> pause and goes on.
>>
>> Now consider above problem scenario with this tool. Once the error
>> occurred the tool will take necessary diagnostic actions and zip the
>> folder. Client can upload that zip folder with the issue so that the
>> support team doesn’t need client to do those diagnostic actions himself.
>> The support team able to work on that issue directly without expecting any
>> updates from the client.
>>
>> The next time error occurs (even at midnight) tool can detect the error
>> and send necessary files to support time directly for further analysis.
>>
>> Hence the tool’s memory footprint is small, client can run the tool
>> without any objection.
>>
>> The tool reduces client’s involvement on WSO2 IS errors so that client
>> can focus on their business. Tool also helps to reduce the time that need
>> to solve the issue because support team could be able to get all necessary
>> diagnostic files by once at initial conversation.
>>
>> Please give feedback regarding this architecture.
>>
>> Best Regards,
>> M.Thumilan
>>
>
>
> --
> *Sinthuja Rajendran*
> Senior Technical Lead
> WSO2, Inc.:http://wso2.com
>
> Blog: http://sinthu-rajan.blogspot.com/
> Mobile: +94774273955
>
>
>


-- 
Thanks & Regards,
Asela

Mobile : +94 777 625 933

http://soasecurity.org/
http://xacmlinfo.org/

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Reply via email to