Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Sinthuja Rajendran Thu, 06 Sep 2018 03:46:40 -0700

Hi,

I have a few questions/concerns on as stated below.


1)  In our WSO2 servers startup script, we do have below java props [1],
which basically can create a heap dump when the server has gone OOM.
Therefore, I believe here you are trying to solve the problem that the
server continues to run, although there is an OOM. IMHO logs are not a
suitable mechanism to find whether the system has gone OOM, because we
can't certainly produce all kind of logs for OOM error. And also in the
proposed method, we can only solve the problem after it has occurred (ie,
incur system outage), and we can't prevent it. IMHO, running the system/JVM
monitoring tool which can monitor and alert after exceeding some percentage
of memory usage is the better solution to solve this problem.

2) Thread dumps are mostly related to slow response (sometimes no response)
from the server, and I'm not sure how can we get these details from
the logs. And we need to intelligently handle the logs, just because of
some request timeout that doesn't mean that we need to take the thread
dump, and it can be simply some backend service is down.

3) We have carbon-dump.sh which can dump all the thread-dump, heap-dump,
relevant details about the server. Can't we use that for this purpose?

[1] -XX:+HeapDumpOnOutOfMemoryError \
    -XX:HeapDumpPath="$RUNTIME_HOME/logs/heap-dump.hprof" \

Thanks,
Sinthuja.

On Thu, Sep 6, 2018 at 3:25 PM Thumilan Mikunthan <[email protected]> wrote:

> Hi all,
>
> *Problem*
>
> Whenever an error occurred (depending on that error) certain diagnostics
> actions can help to diagnose the error.
>
> For example,
>
>    -
>
>    If OOM (Out Of Memory) error occured heap dump will help to analyse
>    memory leak.
>    -
>
>    If some threads blocked abnormally, analyzing thread dump could be
>    able to solve the problem.
>
> But in real scenario, doing these diagnostic actions manually may not
> possible because
>
>    -
>
>    Unable to predict when will the error come.
>    -
>
>    Depending on error diagnostics actions may vary, expecting that user
>    acknowledged about all error scenarios is impossible.
>    -
>
>    User willing to take support from support team instead of solving the
>    error himself/herself.
>
>
> *Solution*
>
> Design a stand alone tool which has less memory footprint (<8%) and less
> CPU usage (<8%) which has following workflow.
>
>    -
>
>    Log Tailer tails the carbon.log file in real time.
>    -
>
>    Match Rule Engine checks whether current log line and error regex are
>    matching or not.
>    -
>
>       Tool has to read error regexs from separate xml file.
>       -
>
>    Interpreter identify the error type and do actions regarding that
>    error.
>    -
>
>       Each action should handle by separate action executer.
>       -
>
>       Mapping between errors and actions should be written in separate
>       xml file.
>       -
>
>    All the diagnostics files (eg:- thread dumps and heap dumps) for a
>    particular error should be created under one folder and zip the folder.
>    -
>
>       Each folder can identify by time instance
>
>
> *Architecture Diagram*
> [image: ArchitectureDiagram.png]
>
> *Sample Scenario*
>
> Assume that client reporting issue about OOM error. He usually attaches
> carbon.log file along with the issue. But in order to solve the problem
> support team needs thread dump and heap dump. So team requires client to
> take those dumps next time. Client has to wait next time and take those
> dumps. (We can’t expect client to watch the server all the time and get
> dumps when error occurs. What if next error occurs at midnight?). Support
> team has to wait for the update on that issue. So they put the issue on
> pause and goes on.
>
> Now consider above problem scenario with this tool. Once the error
> occurred the tool will take necessary diagnostic actions and zip the
> folder. Client can upload that zip folder with the issue so that the
> support team doesn’t need client to do those diagnostic actions himself.
> The support team able to work on that issue directly without expecting any
> updates from the client.
>
> The next time error occurs (even at midnight) tool can detect the error
> and send necessary files to support time directly for further analysis.
>
> Hence the tool’s memory footprint is small, client can run the tool
> without any objection.
>
> The tool reduces client’s involvement on WSO2 IS errors so that client can
> focus on their business. Tool also helps to reduce the time that need to
> solve the issue because support team could be able to get all necessary
> diagnostic files by once at initial conversation.
>
> Please give feedback regarding this architecture.
>
> Best Regards,
> M.Thumilan
>


-- 
*Sinthuja Rajendran*
Senior Technical Lead
WSO2, Inc.:http://wso2.com

Blog: http://sinthu-rajan.blogspot.com/
Mobile: +94774273955

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [IS] On Board Diagnostics Tool for IS

Reply via email to