Hi Karl,

Took me some time to reproduce, but I was able to dump the process after it 
happened again, and it appears that an OOM is the cause of the problem. After 
investigation, it seems that this OOM was triggered by a transformation 
connector I had developed. I increased the JVM heap size a little and the 
problem never happened again. For info, I had limited the number of connections 
of that connector to only 1, to be sure this was not a potential cause of the 
issue.
My question is : To make sure that the agent process crashes instead of staying 
up in a similar case (OOM in my scenario), is there something that can be done 
at the connector level or at a more global level in MCF ?

Regards,
Julien
  

-----Message d'origine-----
De : Karl Wright <daddy...@gmail.com> 
Envoyé : mardi 2 mars 2021 19:17
À : dev <dev@manifoldcf.apache.org>
Objet : Re: Inactive MCF agent

The MCF Agents process shouldn't get hung up under normal operation.  If it 
encounters a problem that may call its continued activity into question, it 
shuts itself down.

There are two situations where the process could theoretically hang.

The first is when you are using file-based synch, and you forcibly kill another 
ManifoldCF process so that it doesn't clean up locks after itself.
But if you are using Zookeeper, it should not ever fail to clean up after a 
process is killed.

The second situation is when certain database conditions arise, and MCF decides 
it needs to reset all its worker threads.  When it does this, it blocks all 
worker threads from proceeding until it reaches a point where they are all 
quiescent, and then it resets all of them at the same time.
When it is waiting for all threads to shut down in this way, if that never 
completely happens, MCF will be paused forever.

What I'd like to do in that case is get a thread dump of the agents process.  
That will tell us what the problem is.

Karl


On Tue, Mar 2, 2021 at 12:53 PM <julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> I recently faced a weird case where a job in a "running" state was not 
> doing anything for several hours. The MCF agent process was up but 
> neither the Simple History nor the logs showed any activity. Since we 
> could not wait more than 12 hours, we decided to restart the agent, 
> and the job "went back on rails" and continued its work normally.
> In order to avoid as much as possible the need for such a manual 
> intervention, I would have two questions:
> - Is there a way to "test" the agent process ? Like a "process ping" 
> which can detect if the process is doing or ready to do something ? 
> And if not, is there a way to implement such thing easily ? The idea 
> being to make the detection and restart automatically rather than 
> manually.
> - Knowing that we have activated the debug log level, would you have 
> recommendation on what to look at to find a potential cause of such an 
> issue ?
>
> Regards,
> Julien Massiera
>
>

Reply via email to