On 2022/3/28 09:38, Corey Minyard wrote:
On Mon, Mar 28, 2022 at 12:47:41AM +0800, Chen Guanqiao wrote:
At present, a scenario has been found that there are too many ipmi messages in a
short period of time, and a large number of users and messages are blocked in
the ipmi modules, resulting in a large amount of system memory being occupied by
ipmi, and ipmi communication always fails.
Frequent calls ipmi and failure of hardware communication will cause this
exception. And ipmi has no way to detect and perceive this problem, therefore
it is impossible to located and perceived online.
Hmm. So you have an application that just keeps sending IPMI messages
and not waiting for responses? I think the first order of business
would be to fix your applications to not do that.
Hi, Corey
Actually, The patch just provides a way to located and perceived this
problem online: display number of users and messages. How to solve the
problem gracefully, I haven't fully thought about it. To cleanup msgs
queue is one of method for administrator.
Because the memory consumption of the module is counted in the
consumption of the kernel, most of the time, the administrator does not
know the state of ipmi, so it is impossible to guess where the memory goes.
Only when they tried to execute 'rmmod ipmi' did they find out: oh ,the
memory is in ipmi.
The ipmi driver will eventually clean things out, but the timeouts are
pretty long. In the 5 second range per message.
However, as you say, there are no limits on users or messages, and that
is perhaps a problem. I mean, only root can send IPMI message, and root
can do a lot more harm than that. But it's probably bad in principle.
Nobody has ever reported this problem before.
If the bmc communication of the device is abnormal, for example, the
hardware is blocked, and the monitoring program repeatedly checks the bmc.
The scenario is often seen in automated monitoring tool.
Of course, this problem is a bit rare, one hundred out of ten thousand
machines, 1% probability.
Anyway, a better solution for the kernel side of things, I think, would
be to add limits on the number of users and the number of messages per
user. That's more inline with what other kernel things do. I know of
nothing else in the kernel that does what you are proposing.
The precondition for add limits, is that people known that ipmi has too
many users and messages cause problems, this patch is to let
administrator known that.
In addition, different machines have different limit, My server my block
700,000 messages and it's fine, and my NAS pc went to OOM when it
probably blocked for 10,000 messages. So, to limit the number of users
and messages, can wait until we have accumulated some online experience?
Does that make sense?
-corey
thanks
--
Chen Guanqiao
This patch provides a method to view the current number of users and messages in
ipmi, and introduce a simple interface to clear the message queue.
Chen Guanqiao (3):
ipmi: Get the number of user through sysfs
ipmi: Get the number of message through sysfs
ipmi: add a interface to clean message queue in sysfs
drivers/char/ipmi/ipmi_msghandler.c | 159 ++++++++++++++++++++++++++++
1 file changed, 159 insertions(+)
--
2.25.1
_______________________________________________
Openipmi-developer mailing list
Openipmi-developer@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openipmi-developer