Phil Frost <phil.fr...@postmates.com> added the comment:

skrah: Yes, that's correct. Since I can only produce this bug in production it 
will take me some days to build and validate a source build. But absent any 
better ideas, I will try.

tim.peters: I've observed this bug across hundreds of EC2 hosts, in dozens of 
code paths, with all kinds of inputs. Moreover, the hosts aren't displaying any 
other symptoms of hardware failure such as random segfaults or mysteriously 
corrupted data.

I've also deeply investigated two cores now which show specifically that `exp` 
seems to get 2 added when it should have been 1. I have a hard time explaining 
how a hardware failure can cause precisely the same failure so reliably.

So I doubt hardware is to blame.

Although, it does seem the issue occurs in "clumps" on individual hosts. So we 
might go 10 hours without seeing the issue, then it may happen 5 times within 
30 minutes on one host. We might observe 1 or 2 more such clumps on the same 
host until the next deploy of the application, at which point all the 
containers are replaced with fresh ones. So this suggests there is some 
ephemeral state within a host that creates a propensity for the issue.

I've also been unable to reproduce the problem in a development environment, 
even when that development environment is using the same kernel, instance 
class, and docker container as production. So I suspect the bug is precipitated 
by some particular concurrency or interaction that I haven't been able to 
replicate.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue37168>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to