Re: Using rr chaos mode to find intermittent bugs
On Fri, Feb 12, 2016 at 8:39 AM, Kyle Hueywrote: > On Thu, Feb 11, 2016 at 11:35 AM, Robert O'Callahan > wrote: > >> On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron < >> nicolas.b.pier...@mozilla.com> wrote: >> >> > On 02/10/2016 08:04 PM, Robert O'Callahan wrote: >> > >> >> Background: >> >> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html >> >> >> >> I just landed on rr master support for a "-h" option which enables a >> chaos >> >> mode for rr recording. This is designed to help reproduce intermittent >> >> test >> >> failures under rr. […] >> >> >> > >> > Thanks Roc, I will give it a try. >> > >> > On the other hand, I used to rely more on the "-c" option to achieve a >> > similar thing in the past, instead of the "-e" option. >> > >> > The reason I did so being that the thread I am interested in does a few >> > syscalls compared to the rest of the program. Thus I felt that using >> "-e" >> > option would give it an unfair large time slices compared to what is >> > supposed to happen if the threads are running concurrently. >> >> >> The -e option is gone now because the new scheduler (with or without chaos >> mode) does not take system calls into account when calculating the length >> of a timeslice. We only count conditional branches. >> > > So we context switch at a syscall now only when the current thread happens > to become unschedulable? > Or if any higher-priority thread has become runnable. This includes not just a low-priority thread doing a FUTEX_WAKE to wake a high-priority thread, but also a thread changing its priority or another thread's priority, or even a low-priority thread writing to a pipe that a high-priority thread is reading from. (Though in the latter case the scheduler *might* not see the high-priority thread become runnable in time in all cases.) Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron < nicolas.b.pier...@mozilla.com> wrote: > On 02/10/2016 08:04 PM, Robert O'Callahan wrote: > >> Background: >> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html >> >> I just landed on rr master support for a "-h" option which enables a chaos >> mode for rr recording. This is designed to help reproduce intermittent >> test >> failures under rr. […] >> > > Thanks Roc, I will give it a try. > > On the other hand, I used to rely more on the "-c" option to achieve a > similar thing in the past, instead of the "-e" option. > > The reason I did so being that the thread I am interested in does a few > syscalls compared to the rest of the program. Thus I felt that using "-e" > option would give it an unfair large time slices compared to what is > supposed to happen if the threads are running concurrently. The -e option is gone now because the new scheduler (with or without chaos mode) does not take system calls into account when calculating the length of a timeslice. We only count conditional branches. Bugs that require very frequent fine-grained context switching are probably still hard to find with chaos mode, because very frequent context switching slows down recording tremendously and I didn't want chaos mode to slow down execution by more than a bounded amount. So you may find that -c is still needed. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On Thu, Feb 11, 2016 at 11:35 AM, Robert O'Callahanwrote: > On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron < > nicolas.b.pier...@mozilla.com> wrote: > > > On 02/10/2016 08:04 PM, Robert O'Callahan wrote: > > > >> Background: > >> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html > >> > >> I just landed on rr master support for a "-h" option which enables a > chaos > >> mode for rr recording. This is designed to help reproduce intermittent > >> test > >> failures under rr. […] > >> > > > > Thanks Roc, I will give it a try. > > > > On the other hand, I used to rely more on the "-c" option to achieve a > > similar thing in the past, instead of the "-e" option. > > > > The reason I did so being that the thread I am interested in does a few > > syscalls compared to the rest of the program. Thus I felt that using > "-e" > > option would give it an unfair large time slices compared to what is > > supposed to happen if the threads are running concurrently. > > > The -e option is gone now because the new scheduler (with or without chaos > mode) does not take system calls into account when calculating the length > of a timeslice. We only count conditional branches. > So we context switch at a syscall now only when the current thread happens to become unschedulable? - Kyle ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On 02/10/2016 08:04 PM, Robert O'Callahan wrote: Background: http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html I just landed on rr master support for a "-h" option which enables a chaos mode for rr recording. This is designed to help reproduce intermittent test failures under rr. […] Thanks Roc, I will give it a try. On the other hand, I used to rely more on the "-c" option to achieve a similar thing in the past, instead of the "-e" option. The reason I did so being that the thread I am interested in does a few syscalls compared to the rest of the program. Thus I felt that using "-e" option would give it an unfair large time slices compared to what is supposed to happen if the threads are running concurrently. -- Nicolas B. Pierron ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Using rr chaos mode to find intermittent bugs
Background: http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html I just landed on rr master support for a "-h" option which enables a chaos mode for rr recording. This is designed to help reproduce intermittent test failures under rr. We already have a few reports of people using this successfully to find difficult bugs. Even though rr works only on desktop Linux (including VMs), I've reproduced a bug that only showed up in automation on Android, and khuey reproduced a bug that only showed up on OSX 10.6. I'm continuing to do experiments to try to reproduce more of our top intermittents, but you may already find rr chaos mode useful. I recommend running a single test or a small group of tests continuously; one of my bugs only had a few failing runs out of a thousand. I'm sure there are still bugs rr can't reproduce, and I'm very interested in hearing about bugs that eventually get fixed but that rr was not able to reproduce. By studying such bugs we can improve rr chaos mode so it can find them. Obviously, once rr chaos mode has proved itself, we should get some automation around it. I'd like a bit more experience with it before we have that discussion. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On Thu, Feb 11, 2016 at 9:32 AM, Ted Mielczarekwrote: > BenWa tried doing some work on this but kept getting hung up > on hitting test failures unrelated to the ones we see in production, possibly due to environment issues. > Yes. In this vein, it's possible that in some cases rr chaos mode might trigger bugs that don't normally happen, that one way or another block you from finding the bug you care about. However, bugs found by rr chaos mode should all be "real bugs". I'd certainly love to hear about any cases where that's not true. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On Wed, Feb 10, 2016, at 03:04 PM, Robert O'Callahan wrote: > Background: > http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html > > I just landed on rr master support for a "-h" option which enables a > chaos > mode for rr recording. This is designed to help reproduce intermittent > test > failures under rr. We already have a few reports of people using this > successfully to find difficult bugs. Even though rr works only on desktop > Linux (including VMs), I've reproduced a bug that only showed up in > automation on Android, and khuey reproduced a bug that only showed up on > OSX 10.6. > > I'm continuing to do experiments to try to reproduce more of our top > intermittents, but you may already find rr chaos mode useful. I recommend > running a single test or a small group of tests continuously; one of my > bugs only had a few failing runs out of a thousand. I'm sure there are > still bugs rr can't reproduce, and I'm very interested in hearing about > bugs that eventually get fixed but that rr was not able to reproduce. By > studying such bugs we can improve rr chaos mode so it can find them. > > Obviously, once rr chaos mode has proved itself, we should get some > automation around it. I'd like a bit more experience with it before we > have > that discussion. This is great! I've kept holding out hope that rr can help us fix intermittent test failures, but so far we've failed to actually prove this out. BenWa tried doing some work on this but kept getting hung up on hitting test failures unrelated to the ones we see in production, possibly due to environment issues. jmaher and armenzg and others have been doing some great work lately standing up Linux tests in Taskcluster, as a side effect of which means we now have a Docker image for running Linux tests. If anyone wants to prototype reproducing failures from CI running rr inside that image would be a good place to start. -Ted ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On 2016/02/11 5:47, Robert O'Callahan wrote: On Thu, Feb 11, 2016 at 9:32 AM, Ted Mielczarekwrote: BenWa tried doing some work on this but kept getting hung up on hitting test failures unrelated to the ones we see in production, possibly due to environment issues. Yes. In this vein, it's possible that in some cases rr chaos mode might trigger bugs that don't normally happen, that one way or another block you from finding the bug you care about. However, bugs found by rr chaos mode should all be "real bugs". I'd certainly love to hear about any cases where that's not true. Rob This scheduling change causing rare to reproduce bugs to occur more often sounds interesting. I have found that running C-C TB (sorry it is not the browser here) under valgrind/memcheck which slows down the operation dramatically have helped me to find a few issues. From the top of my head: - incremental GC gets re-entered before it finishes the previous invocation. This was not handled properly until I noticed the issue, but it is now handled OK. - there are some issues in threading. For one, at start up, some threads incorrectly assume that window as on screen is already there, but due to the slowdown, it is not created yet. I see some disturbing warning messages printed on the invoking tty window. I have not filed a bug yet since this is relatively new. I don't think I saw such messages early last year. For the other, at shutdown, C-C TB has a problem of incorrect ordering of thread shutdown: some threads seem to request services during shutdown from service providers, but threads that provide the services have already shutdown. So proper shutdown does not happen. There may even be a cyclic dependency. Who knows? With the slowdown due to valgrind/memcheck, the issue gets more pronounced. Well, right now, though, there is a timer that monitors the shtudown process and the prolonged timeout of some operations due to the thread missing and the slowdown caused by valgrind/memcheck automatically triggers the assertion of permanent hung at shutdown and so it is difficult to figure out what are going on. But one can hope that the check for permanent hung gets removed temporarily to investigate the issue further. Crashes at C-C TB are something I experienced several times in the last couple of years in real life. Another thing this rr framework or similar approach will be useful for C-C TB xpcshell testing (and I think it is useful for FF xpcshell testing as well.) There seem to be a few intermittent test failures in xpcshell tests. This rr approach may make the test fail more often. *HOWEVER*, I am going to file a bugzilla about OVEREAGER ASYNC approach of the current test xpcshell script introducing spurious errors at least under Windows (a previous test which still have some files open has not completely shut down before the next test that seems to use THOSE files get started. Under windows, opening such a file may result in file locked error (under linux/OSX, I think it is OK to open such files unless the first program explicitly calls |flock| or something.) So whether ALL the intermittent failures in C-C TB xpcshell tests are something that can be investigated better with rr approach is anyone's guess, but I think it does have a potential to trigger more dormant bugs just as valgrind/memcheck uncovered a few timing issues. But one other post suggested that it is not applicable right now outside Gecko, meaning C-C TB xpcshell testing cannot directly benefit from rr? (The approach, of course, can be emulated, I suppose.) TIA ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
rr should work fine with c-c xpcshell tests (and most other Linux programs). ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Using rr chaos mode to find intermittent bugs
On 2016/02/11 7:04, Robert O'Callahan wrote: rr should work fine with c-c xpcshell tests (and most other Linux programs). This sounds great! CI ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform