Hi all,

Recently, we found in some real user cases that when OOM occurs in the
DataNode process (although we should ensure that OOM does not happen, but
we all know that bugs will always exist), some threads(e.g. rpc listening
threads) may exit unexpectedly which may cause some strange things to
happen. For example, if the heartbeat listening thread on the DataNode
unexpectedly exits due to OOM, and then the OOM recovers on its own (some
large queries end, or some compaction tasks end), but this thread will
never exist again, causing the DataNode to remain in unknown state, because
the ConfigNode can no longer contact it via heartbeat.

Therefore, we feel that OOM is a high-risk error, and we should let the
process exit directly to avoid the loss of some key threads.

And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
-XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
keep both in jvm args and when OOM happens, it will firstly dump the heap
memory and then exit.

I've made this change in my pr(https://github.com/apache/iotdb/pull/11531).

What do you think?




Best,
----------------------
Yuan Tian

Reply via email to