Re: Using rr chaos mode to find intermittent bugs

2016-02-11 Thread Robert O'Callahan
On Fri, Feb 12, 2016 at 8:39 AM, Kyle Huey  wrote:

> On Thu, Feb 11, 2016 at 11:35 AM, Robert O'Callahan 
> wrote:
>
>> On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron <
>> nicolas.b.pier...@mozilla.com> wrote:
>>
>> > On 02/10/2016 08:04 PM, Robert O'Callahan wrote:
>> >
>> >> Background:
>> >> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
>> >>
>> >> I just landed on rr master support for a "-h" option which enables a
>> chaos
>> >> mode for rr recording. This is designed to help reproduce intermittent
>> >> test
>> >> failures under rr. […]
>> >>
>> >
>> > Thanks Roc, I will give it a try.
>> >
>> > On the other hand, I used to rely more on the "-c" option to achieve a
>> > similar thing in the past, instead of the "-e" option.
>> >
>> > The reason I did so being that the thread I am interested in does a few
>> > syscalls compared to the rest of the program.  Thus I felt that using
>> "-e"
>> > option would give it an unfair large time slices compared to what is
>> > supposed to happen if the threads are running concurrently.
>>
>>
>> The -e option is gone now because the new scheduler (with or without chaos
>> mode) does not take system calls into account when calculating the length
>> of a timeslice. We only count conditional branches.
>>
>
> So we context switch at a syscall now only when the current thread happens
> to become unschedulable?
>

Or if any higher-priority thread has become runnable. This includes not
just a low-priority thread doing a FUTEX_WAKE to wake a high-priority
thread, but also a thread changing its priority or another thread's
priority, or even a low-priority thread writing to a pipe that a
high-priority thread is reading from. (Though in the latter case the
scheduler *might* not see the high-priority thread become runnable in time
in all cases.)

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf
toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea
lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-11 Thread Robert O'Callahan
On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron <
nicolas.b.pier...@mozilla.com> wrote:

> On 02/10/2016 08:04 PM, Robert O'Callahan wrote:
>
>> Background:
>> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
>>
>> I just landed on rr master support for a "-h" option which enables a chaos
>> mode for rr recording. This is designed to help reproduce intermittent
>> test
>> failures under rr. […]
>>
>
> Thanks Roc, I will give it a try.
>
> On the other hand, I used to rely more on the "-c" option to achieve a
> similar thing in the past, instead of the "-e" option.
>
> The reason I did so being that the thread I am interested in does a few
> syscalls compared to the rest of the program.  Thus I felt that using "-e"
> option would give it an unfair large time slices compared to what is
> supposed to happen if the threads are running concurrently.


The -e option is gone now because the new scheduler (with or without chaos
mode) does not take system calls into account when calculating the length
of a timeslice. We only count conditional branches.

Bugs that require very frequent fine-grained context switching are probably
still hard to find with chaos mode, because very frequent context switching
slows down recording tremendously and I didn't want chaos mode to slow down
execution by more than a bounded amount. So you may find that -c is still
needed.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf
toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea
lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-11 Thread Kyle Huey
On Thu, Feb 11, 2016 at 11:35 AM, Robert O'Callahan 
wrote:

> On Thu, Feb 11, 2016 at 11:55 PM, Nicolas B. Pierron <
> nicolas.b.pier...@mozilla.com> wrote:
>
> > On 02/10/2016 08:04 PM, Robert O'Callahan wrote:
> >
> >> Background:
> >> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
> >>
> >> I just landed on rr master support for a "-h" option which enables a
> chaos
> >> mode for rr recording. This is designed to help reproduce intermittent
> >> test
> >> failures under rr. […]
> >>
> >
> > Thanks Roc, I will give it a try.
> >
> > On the other hand, I used to rely more on the "-c" option to achieve a
> > similar thing in the past, instead of the "-e" option.
> >
> > The reason I did so being that the thread I am interested in does a few
> > syscalls compared to the rest of the program.  Thus I felt that using
> "-e"
> > option would give it an unfair large time slices compared to what is
> > supposed to happen if the threads are running concurrently.
>
>
> The -e option is gone now because the new scheduler (with or without chaos
> mode) does not take system calls into account when calculating the length
> of a timeslice. We only count conditional branches.
>

So we context switch at a syscall now only when the current thread happens
to become unschedulable?

- Kyle
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-11 Thread Nicolas B. Pierron

On 02/10/2016 08:04 PM, Robert O'Callahan wrote:

Background:
http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

I just landed on rr master support for a "-h" option which enables a chaos
mode for rr recording. This is designed to help reproduce intermittent test
failures under rr. […]


Thanks Roc, I will give it a try.

On the other hand, I used to rely more on the "-c" option to achieve a 
similar thing in the past, instead of the "-e" option.


The reason I did so being that the thread I am interested in does a few 
syscalls compared to the rest of the program.  Thus I felt that using "-e" 
option would give it an unfair large time slices compared to what is 
supposed to happen if the threads are running concurrently.


--
Nicolas B. Pierron
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Using rr chaos mode to find intermittent bugs

2016-02-10 Thread Robert O'Callahan
Background:
http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html

I just landed on rr master support for a "-h" option which enables a chaos
mode for rr recording. This is designed to help reproduce intermittent test
failures under rr. We already have a few reports of people using this
successfully to find difficult bugs. Even though rr works only on desktop
Linux (including VMs), I've reproduced a bug that only showed up in
automation on Android, and khuey reproduced a bug that only showed up on
OSX 10.6.

I'm continuing to do experiments to try to reproduce more of our top
intermittents, but you may already find rr chaos mode useful. I recommend
running a single test or a small group of tests continuously; one of my
bugs only had a few failing runs out of a thousand. I'm sure there are
still bugs rr can't reproduce, and I'm very interested in hearing about
bugs that eventually get fixed but that rr was not able to reproduce. By
studying such bugs we can improve rr chaos mode so it can find them.

Obviously, once rr chaos mode has proved itself, we should get some
automation around it. I'd like a bit more experience with it before we have
that discussion.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf
toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea
lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-10 Thread Robert O'Callahan
On Thu, Feb 11, 2016 at 9:32 AM, Ted Mielczarek  wrote:

> BenWa tried doing some work on this but kept getting hung up
> on hitting test failures unrelated to the ones we see in production,

possibly due to environment issues.
>

Yes. In this vein, it's possible that in some cases rr chaos mode might
trigger bugs that don't normally happen, that one way or another block you
from finding the bug you care about.

However, bugs found by rr chaos mode should all be "real bugs". I'd
certainly love to hear about any cases where that's not true.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf
toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t
rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea
lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr
esn
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-10 Thread Ted Mielczarek


On Wed, Feb 10, 2016, at 03:04 PM, Robert O'Callahan wrote:
> Background:
> http://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html
> 
> I just landed on rr master support for a "-h" option which enables a
> chaos
> mode for rr recording. This is designed to help reproduce intermittent
> test
> failures under rr. We already have a few reports of people using this
> successfully to find difficult bugs. Even though rr works only on desktop
> Linux (including VMs), I've reproduced a bug that only showed up in
> automation on Android, and khuey reproduced a bug that only showed up on
> OSX 10.6.
> 
> I'm continuing to do experiments to try to reproduce more of our top
> intermittents, but you may already find rr chaos mode useful. I recommend
> running a single test or a small group of tests continuously; one of my
> bugs only had a few failing runs out of a thousand. I'm sure there are
> still bugs rr can't reproduce, and I'm very interested in hearing about
> bugs that eventually get fixed but that rr was not able to reproduce. By
> studying such bugs we can improve rr chaos mode so it can find them.
> 
> Obviously, once rr chaos mode has proved itself, we should get some
> automation around it. I'd like a bit more experience with it before we
> have
> that discussion.

This is great! I've kept holding out hope that rr can help us fix
intermittent test failures, but so far we've failed to actually prove
this out. BenWa tried doing some work on this but kept getting hung up
on hitting test failures unrelated to the ones we see in production,
possibly due to environment issues. jmaher and armenzg and others have
been doing some great work lately standing up Linux tests in
Taskcluster, as a side effect of which means we now have a Docker image
for running Linux tests. If anyone wants to prototype reproducing
failures from CI running rr inside that image would be a good place to
start.

-Ted
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-10 Thread ISHIKAWA,chiaki

On 2016/02/11 5:47, Robert O'Callahan wrote:

On Thu, Feb 11, 2016 at 9:32 AM, Ted Mielczarek  wrote:


BenWa tried doing some work on this but kept getting hung up
on hitting test failures unrelated to the ones we see in production,

possibly due to environment issues.
Yes. In this vein, it's possible that in some cases rr chaos mode might
trigger bugs that don't normally happen, that one way or another block you
from finding the bug you care about.

However, bugs found by rr chaos mode should all be "real bugs". I'd
certainly love to hear about any cases where that's not true.

Rob


This scheduling change causing rare to reproduce bugs to occur more 
often sounds interesting.


I have found that running C-C TB (sorry it is not the browser here)
under valgrind/memcheck which slows down the operation dramatically
have helped me to find a few issues.
From the top of my head:
 - incremental GC gets re-entered before it finishes the previous 
invocation.

   This was not handled properly until I noticed the issue, but it is
   now handled OK.
 - there are some issues in threading.
   For one, at start up, some threads incorrectly assume that window as 
on screen is

   already there, but due to the slowdown, it is not created yet.
   I see some disturbing warning messages printed on the invoking tty 
window.
   I have not filed a bug yet since this is relatively new. I don't 
think I saw

   such messages early last year.

   For the other, at shutdown, C-C TB has a problem of incorrect 
ordering of

   thread shutdown: some threads seem to request services during shutdown
   from service providers, but threads that provide the services have 
already
   shutdown. So proper shutdown does not happen. There may even be a 
cyclic

   dependency. Who knows?
   With the slowdown due to valgrind/memcheck, the issue
   gets more pronounced. Well, right now, though, there is
   a timer that monitors the shtudown process and the prolonged timeout of
   some operations due to the thread missing and the slowdown caused by
   valgrind/memcheck automatically triggers the assertion of permanent 
hung at
   shutdown and so it is difficult to figure out what are going on. But 
one can

   hope that the check for permanent hung gets removed temporarily to
   investigate the issue further.
   Crashes at C-C TB are something I experienced several times in the last
   couple of years in real life.


Another thing this rr framework or similar approach will be useful for 
C-C TB xpcshell testing (and I think it is useful for FF xpcshell 
testing as well.)


There seem to be a few intermittent test failures in xpcshell tests.
This rr approach may make the test fail more often.

*HOWEVER*, I am going to file a bugzilla about
OVEREAGER ASYNC approach of the current test xpcshell script introducing
spurious errors at least under Windows (a previous test which still have 
some files open has not completely shut down before the next test that 
seems to use

THOSE files get started. Under windows, opening such a file may result in
file locked error (under linux/OSX, I think it is OK to open such files 
unless the first program explicitly calls |flock| or something.)


So whether ALL the intermittent failures in C-C TB xpcshell tests are 
something that can be investigated better with rr approach is anyone's 
guess, but

I think it does have a potential to trigger more dormant bugs just as
valgrind/memcheck uncovered a few timing issues.

But one other post suggested that it is not applicable right now outside 
Gecko, meaning C-C TB xpcshell testing cannot directly benefit from rr?

(The approach, of course, can be emulated, I suppose.)

TIA


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-10 Thread Robert O'Callahan
rr should work fine with c-c xpcshell tests (and most other Linux programs).
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Using rr chaos mode to find intermittent bugs

2016-02-10 Thread ISHIKAWA,chiaki

On 2016/02/11 7:04, Robert O'Callahan wrote:

rr should work fine with c-c xpcshell tests (and most other Linux programs).

This sounds great!

CI

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform