On Friday 14 April 2006 11:24, Cedric Tefft wrote:
Hello,

> Hi all -
>
> For the last year or two (through three or four upgrades), Bacula has
> been segfaulting on me at irregular intervals, averaging approximately
> once a month.  Recently, however, the problem has been occuring more
> frequently, so this week I buckled down and tried to ferret out the
> problem.  It has been maddeningly difficult to reproduce, but  I now
> have a config that will consistently cause a segfault on my system.  It
> appears the problem is somehow related to Python, but I'm at a loss as
> to exactly what's wrong.  The Python script may execute flawlessly a
> dozen times and then on the next job, Bacula segfaults (apparently in
> pythonlib).  The odd thing is that, as far as I can tell, the Python
> script is making exactly the same branching decisions and executing
> exactly the same pieces of code when it segfaults as it did when it ran
> through just fine.  The other puzzling thing is that I can prevent the
> segfault by changing virtually anything about the director's config --
> things that, as far as I know, should have no effect on the Python
> script one way or another. I suspect several factors are interacting in
> JUST the right way to cause the segfault, but I'll be darned if I can
> untangle the mess.  Maybe one of you fine folks will have some insight.
> Here's what I 've got:
>
> You can see from the director config that I've got the same three jobs
> scheduled to run three times in a row.  You can see from the console log
> that all three jobs run successfully twice in a row.  Then, on the third
> run, it appears the first job segfaults in the middle of the Python
> script.  However, you can see from the director's debugging output that
> the segfault occurs just after the job record for the third job (the
> catalog backup) is created even though it doesn't show up in the console
> log.  My suspicion is that the segfault has something to do with the
> timing of the initialization of the third job, but I could be way off
> the mark there.
>
> Anyway, just to confuse the issue, in the course of finding a config
> that would consistently segfault, I found that any one of these changes
> will prevent it:
>
> * Disable any one of the three jobs (by commenting out the Schedule line)
> * Change the Level defined in the schedule from Incremental to Full
> * In the CatalogTest job, change the RunBefore script from
> make_catalog_backup to make_catalog_backup_fake (attached)
> * Create a separate schedule for the catalog backup job and have it
> start one minute later than the other two
>
> Any ideas what's wrong?  What I could test?

From the gdb traceback that you supplied (nice work), it looks like there is 
something seriously wrong with PyEval_AcquireLock(), which I understood would 
acquire a global lock in Python since it is not re-entrant (i.e. not thread 
safe).  I say this because I see that two Bacula threads are both executing 
Python library code, and that shouldn't happen.

I suspect that your problem only occurs when two jobs are running 
simultaneously and that both jobs are executing Python code -- this seems to 
agree with the fact that changing any little thing prevents the problem, 
which is typical of race conditions.

The easiest thing to test if my theory is correct is for me to add a mutex 
(pthreads lock) that I am sure will prevent multiple Bacula threads from 
entering Python code at the same time.  The longer term solution would 
involve understanding why the Python lock fails -- my suspicion is that 
Bacula executes the Python lock recursively and hence unlocks it 
"recursively", which will not work if it is not expecting to be called 
recursively.  Normally the lock should not be recursively called, but this 
could happen if a Python event that you have defined calls back to Bacula and 
then recalls the Python libraries.

Can you confirm either by testing or from your knowledge of the problem that 
it does not occur if there are not simultaneous jobs using Python.  If this 
is the case, let me know and I'll work on adding a correct recursive lock 
that allows only one Bacula thread at a time in the Python libraries ...

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to