#2848: ThreadDelay can wait forever, next time on January 22, 2009, around 20:00
UTC.
-----------------------------+----------------------------------------------
Reporter:  tomekz            |          Owner:                  
    Type:  bug               |         Status:  new             
Priority:  normal            |      Component:  Runtime System  
 Version:  6.10.1            |       Severity:  critical        
Keywords:                    |       Testcase:                  
      Os:  Unknown/Multiple  |   Architecture:  Unknown/Multiple
-----------------------------+----------------------------------------------
 We(*) have found a serious bug in the non-threaded RTS on 32-bit *nix
 systems.
 Calling threadDelay at certain clock-time intervals can make the thread
 wait forever. This happens regularly, approximately every 49 days 17
 hours,
 or, more accurately, every (2^32 / 1000) seconds. You can reproduce the
 bug with
 a simple program, if you set your machine's clock to a specially crafted
 time.

 In this description I assume a 32-bit *nix system, where the type
 unsigned long has 32 bits.

 If you want to see this happen, here is an instruction:

 Test.hs:

 {{{
 import Control.Concurrent

 main = threadDelay 5000000 -- 5 secs
 }}}

 Running on Linux, as root:

 {{{
 # date --set 'Thu Jan 22 20:20:10 UTC 2009' +%s; strace ./Delay
 ...
 gettimeofday({1232655610, 12122}, NULL) = 0
 gettimeofday({1232655610, 12171}, NULL) = 0
 gettimeofday({1232655610, 12195}, NULL) = 0
 select(0, [], [], NULL, {5, 20000})     = 0 (Timeout)
 gettimeofday({1232655615, 32125}, NULL) = 0
 select(0, [], [], NULL, {0, 4000})      = 0 (Timeout)
 gettimeofday({1232655615, 35777}, NULL) = 0
 select(0, [], [], NULL, {0, 4000})      = 0 (Timeout)
 gettimeofday({1232655615, 40039}, NULL) = 0
 select(0, [], [], NULL, {4294, 951296}
 }}}

 The runtime slept for more than 5 seconds, but it didn't wake up the
 thread, but instead started waiting for an evil looking amount of time.

 I've managed to find the exact location of the bug, in function
 getourtimeofday() in rts/posix/Itimer.c. Other parts of non-threaded
 RTS, and especially the function awaitEvent(rtsBool) in
 rts/posix/Select.c, assume that the results of getourtimeofday() use the
 full range of values in lnat. But this isn't so. Let's take a look at
 the current version of this function:

 {{{
 lnat
 getourtimeofday(void)
 {
   struct timeval tv;
   nat interval;
   interval = RtsFlags.MiscFlags.tickInterval;
   if (interval == 0) { interval = 50; }
   gettimeofday(&tv, (struct timezone *) NULL);
         // cast to lnat because nat may be 64 bit when int is only 32 bit
   return ((lnat)tv.tv_sec * 1000 / interval +
           (lnat)tv.tv_usec / (interval * 1000));
 }
 }}}

 Look at the first half of the expression in the return statement:

 {{{
 (lnat)tv.tv_sec * 1000 / interval
 }}}

 Here, tv.tv_sec is first converted to lnat, than multiplied by 1000, and
 finally divided by interval. lnat is defined as "unsigned long" - on a
 32-bit system this is usually a 32-bit type. The result of multiplying
 by 1000 will have its higher bits lost and it will be in the range [0 ..
 2^32 - 1]. Dividing it by interval will shrink the range to [0 .. (2^32
 - 1) / interval]. With default value of tickInterval - which is 50 -
 this will be [0 .. 85899345]. Adding the second part of the expression
 can only increase it by (999999 / (interval * 1000)), which is equal to
 19 with default settings.

 So, for default parameters, the result of getourtimeofday() will be in
 [0 .. 85899364]. Why is it a problem? It's quite simple: When a thread
 executes threadDelay the runtime calculates the tick at which it should
 be woken up. In our example, threadDelay is executed at tick 85899266,
 and the time of wake-up is probably calculated as 85899266 + 5 * (1000 /
 50) = 85899366. This is greater then the biggest value getourtimeofday()
 can return.

 At this point I've probably already given too much detail, so I'll
 stop. I think the fix is quite easy: just perform the problematic part
 of computation on a wider integral type and cast it to lnat at the end.

 There is also an easy workaround: compile your program with -threaded.
 The threaded runtime does not use this code, and AFAICT, the time is
 tracked using a 64-bit number of microseconds.

 When I was trying to understand the problem, for some time the culprit
 for me was the "clever trick" mentionad at the top of
 rts/posix/Select.c.  Now I understand it better and think it's innocent.

 We have found this bug in GHC 6.6, and it's still there in the HEAD.

 (*) "We" means Mariusz Gądarowski, Przemysław Kosiak, Jakub Bogusz and
 Tomasz Zielonka - employees of Gemius SA in Poland.
 Mariusz and Przemek encountered this bug in their haskell system for
 distributing computations on a cluster. Mariusz and Jakub provided data
 about the symptoms of the problem - that was invaluable
 because, as you can now see, the bug was very difficult to reproduce.
 Tomasz finally spent four hours in solitude to nail the bug, and another
 hour to write this report. The problem repeated itself for serveral
 months, and in desperation Mariusz started to think about rewriting the
 system in C++. Unfortunately, there is still such a posibility, because
 we have another similar problem to solve. We are more optimistic now, but
 this event raised some doubts about the development model of GHC. We are
 afraid that with the current emphasis on adding new features you may be
 adding
 bugs faster that you are removing them :-/

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/2848>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to