I decided to start a new thread for this as the original one seems to have
forked multiple times.

This thread is related to https://github.com/Test-More/test-more/issues/589
alone.

David and I have both put some time into reproducing this, and I want to
put out a summary of my findings (I will leave it to David to summarize
his, if he cares to).

I was able to boil the issue down to a fairly simple test-script:
https://gist.github.com/exodist/ef08a3c406d98c1bb618 which is able to
reproduce the problem sometimes. XDG has a slightly different version, but
in both cases the key part is this:

ok(1);
> my ($index) = grep { $_->{name} eq 'foo' } $coll->get_indexes;
> ok(1);


The first ok is used to get some state into the system, if the state is not
there the problem does not occur, or occurs too infrequently to measure.
Ending the test right after the first ok also never seems to reproduce the
issue.

The second line gets a variable generated deep inside the mongodb code, and
I have been treating it like a black box. The issue never occurs if we get
rid of this line, or modify it so that $index is cleared before the second
ok() line.

The second ok() seems to trigger the bug, to be specific we know that the
scalar with the refcount mentioned in the warning is the one being created
and blessed here:

https://github.com/Test-More/test-more/blob/stream/master/lib/Test/Stream/Context.pm#L263

Knowing that does not help us much. This same code runs with no problems if
$index is destroyed before it runs. Making any significant changed to this
method, or the methods leading up to it in the stack seem to effect the
frequency at which the bug occurs, but will not eliminate it completely
(This initially had us following a red herring where localizing $_ seemed
to remove the problem, but that ultimately turned out to be incorrect).

So far we have only been able to reproduce this with any reliability on
darwin, using perls 5.20.2 and 5.20.1. Alh managed to reproduce it at least
once on linux, but it is not frequent enough to really dig in. 5.18.4 and
earlier do not appear to reproduce the issue at all. I have not been able
to try 5.20.0 or blead, and I do not expect to have the resources to do so
in the near-term (I do not have a darwin box, and the one I have been using
is being used for other things by its owner).

There are many other things that can be done to alter the frequency, in
some cases enough so to make the problem appear solved, but ultimately it
seems that the more state you have built up and stored, the more likely the
issue is to occur.

At this point I am unable to continue further on this investigation, so I
am writing up this summary for others to read and comment on, and possibly
continue the investigation.

-Chad

Reply via email to