Your report shows that tasks 41 49 53 74 crashed and never provided a response. Waiting for these tasks to finish puts qrun into the poll loop that never finishes. Those tasks got started, but for some reason crashed. This is the same failure I see here. Except yours seem to be more likely for some reason. This should be enough to go on to figure out the problem and the fix.
On Fri, Oct 6, 2017 at 12:30 PM, Eric Iverson <[email protected]> wrote: > Thanks for the report. I have fired up another windows machine (win10 > rather than win7) and have again managed to get a failure that looks the > same as yours. This should make it easier to track down. > > On Fri, Oct 6, 2017 at 11:14 AM, 'Pascal Jasmin' via Programming < > [email protected]> wrote: > >> the file is called jcs.log (attached, not sure if I'm allowed). I had 5 >> extra jconsole processes on this run in jqt. (jconsole can also have more >> than 1 extra unclosed process sometimes when it fails) >> >> There's a continuation of the important previous pattern: all jobs were >> started and finished, and it is the last kill command that results in hang. >> >> >> One idea (thought to be unnecessary with zmq) might be to track finished >> status and PIDs, and clean up and terminate after all done. >> console output >> >> qrun 99 88 2 >> start: 0 0 >> start: 1 1 >> start: 2 2 >> start: 3 3 >> start: 4 4 >> start: 5 5 >> start: 6 6 >> start: 7 7 >> start: 8 8 >> start: 9 9 >> start: 10 10 >> start: 11 11 >> start: 12 12 >> start: 13 13 >> start: 14 14 >> start: 15 15 >> start: 16 16 >> start: 17 17 >> start: 18 18 >> start: 19 19 >> start: 20 20 >> start: 21 21 >> start: 22 22 >> start: 23 23 >> start: 24 24 >> start: 25 25 >> start: 26 26 >> start: 27 27 >> start: 28 28 >> start: 29 29 >> start: 30 30 >> start: 31 31 >> start: 32 32 >> start: 33 33 >> start: 34 34 >> start: 35 35 >> start: 36 36 >> start: 37 37 >> start: 38 38 >> start: 39 39 >> start: 40 40 >> start: 41 41 >> start: 42 42 >> start: 43 43 >> start: 44 44 >> start: 45 45 >> start: 46 46 >> start: 47 47 >> start: 48 48 >> start: 49 49 >> start: 50 50 >> start: 51 51 >> start: 52 52 >> start: 53 53 >> start: 54 54 >> start: 55 55 >> start: 56 56 >> start: 57 57 >> start: 58 58 >> start: 59 59 >> start: 60 60 >> start: 61 61 >> start: 62 62 >> start: 63 63 >> start: 64 64 >> start: 65 65 >> start: 66 66 >> start: 67 67 >> start: 68 68 >> start: 69 69 >> start: 70 70 >> start: 71 71 >> start: 72 72 >> start: 73 73 >> start: 74 74 >> start: 75 75 >> start: 76 76 >> start: 77 77 >> start: 78 78 >> start: 79 79 >> start: 80 80 >> start: 81 81 >> start: 82 82 >> start: 83 83 >> start: 84 84 >> start: 85 85 >> start: 86 86 >> start: 87 87 >> finish: 0 0 >> finish: 1 1 >> finish: 2 2 >> finish: 3 3 >> finish: 4 4 >> finish: 5 5 >> finish: 6 6 >> finish: 7 7 >> finish: 8 8 >> finish: 9 9 >> finish: 10 10 >> finish: 11 11 >> finish: 12 12 >> finish: 13 13 >> finish: 14 14 >> finish: 15 15 >> finish: 16 16 >> finish: 17 17 >> finish: 18 18 >> finish: 19 19 >> finish: 20 20 >> finish: 21 21 >> finish: 22 22 >> finish: 23 23 >> finish: 24 24 >> finish: 25 25 >> finish: 26 26 >> finish: 27 27 >> finish: 28 28 >> finish: 29 29 >> finish: 30 30 >> finish: 31 31 >> finish: 32 32 >> finish: 33 33 >> finish: 34 34 >> start: 88 0 >> start: 89 1 >> start: 90 2 >> start: 91 3 >> start: 92 4 >> start: 93 5 >> start: 94 6 >> start: 95 7 >> start: 96 8 >> start: 97 9 >> start: 98 10 >> kill: 11 >> kill: 12 >> kill: 13 >> kill: 14 >> kill: 15 >> kill: 16 >> kill: 17 >> kill: 18 >> kill: 19 >> kill: 20 >> kill: 21 >> kill: 22 >> kill: 23 >> kill: 24 >> kill: 25 >> kill: 26 >> kill: 27 >> kill: 28 >> kill: 29 >> kill: 30 >> kill: 31 >> kill: 32 >> kill: 33 >> kill: 34 >> finish: 35 35 >> finish: 36 36 >> finish: 37 37 >> finish: 38 38 >> finish: 39 39 >> finish: 40 40 >> finish: 44 44 >> finish: 46 46 >> finish: 50 50 >> finish: 51 51 >> finish: 86 86 >> kill: 35 >> kill: 36 >> kill: 37 >> kill: 38 >> kill: 39 >> kill: 40 >> kill: 44 >> kill: 46 >> kill: 50 >> kill: 51 >> kill: 86 >> finish: 88 0 >> finish: 89 1 >> finish: 90 2 >> finish: 91 3 >> finish: 92 4 >> finish: 93 5 >> finish: 94 6 >> finish: 95 7 >> finish: 96 8 >> finish: 97 9 >> finish: 98 10 >> finish: 42 42 >> finish: 43 43 >> finish: 45 45 >> finish: 48 48 >> finish: 54 54 >> finish: 55 55 >> finish: 56 56 >> finish: 57 57 >> finish: 58 58 >> finish: 59 59 >> finish: 60 60 >> finish: 61 61 >> finish: 62 62 >> finish: 63 63 >> finish: 64 64 >> finish: 65 65 >> finish: 66 66 >> finish: 67 67 >> finish: 68 68 >> finish: 69 69 >> finish: 70 70 >> finish: 71 71 >> finish: 72 72 >> finish: 76 76 >> finish: 78 78 >> finish: 79 79 >> finish: 80 80 >> finish: 82 82 >> finish: 83 83 >> finish: 84 84 >> kill: 0 >> kill: 1 >> kill: 2 >> kill: 3 >> kill: 4 >> kill: 5 >> kill: 6 >> kill: 7 >> kill: 8 >> kill: 9 >> kill: 10 >> kill: 42 >> kill: 43 >> kill: 45 >> kill: 48 >> kill: 54 >> kill: 55 >> kill: 56 >> kill: 57 >> kill: 58 >> kill: 59 >> kill: 60 >> kill: 61 >> kill: 62 >> kill: 63 >> kill: 64 >> kill: 65 >> kill: 66 >> kill: 67 >> kill: 68 >> kill: 69 >> kill: 70 >> kill: 71 >> kill: 72 >> kill: 76 >> kill: 78 >> kill: 79 >> kill: 80 >> kill: 82 >> kill: 83 >> kill: 84 >> finish: 73 73 >> finish: 75 75 >> finish: 77 77 >> finish: 85 85 >> finish: 87 87 >> kill: 73 >> kill: 75 >> kill: 77 >> kill: 85 >> kill: 87 >> finish: 47 47 >> finish: 52 52 >> kill: 47 >> kill: 52 >> finish: 81 81 >> kill: 81 >> poll 0: >> >> >> >> >> >> ________________________________ >> From: Eric Iverson <[email protected]> >> To: Programming forum <[email protected]> >> Sent: Friday, October 6, 2017 9:36 AM >> Subject: Re: [Jprogramming] qrun - jcs - zmq >> >> >> >> Pascal, >> >> The logfile_jcs_ includes writes from started tasks that are interspersed >> with the screen output. Both output are useful. >> >> Please get a simple failure and send me the text of the session as well as >> the text of the logfile_jcs_. >> >> At that same time give me the output of windows command: >> ...> tasklist /FI "imagename eq jconsole.exe" >> >> Thanks. >> >> On Thu, Oct 5, 2017 at 9:52 PM, 'Pascal Jasmin' via Programming < >> [email protected]> wrote: >> >> > each failure leaves behind 1 stranded jconsole task >> > >> > >> > >> > >> > ________________________________ >> > From: bill lam <[email protected]> >> > To: Programming forum <[email protected]> >> > Sent: Thursday, October 5, 2017 9:09 PM >> > Subject: Re: [Jprogramming] qrun - jcs - zmq >> > >> > >> > >> > The mission of stress test is to make it fail and a large of task is >> > important, try on jconsole >> > >> > qrun 99 99 1 >> > or >> > 2 qrun 99 99 2 >> > and eventually >> > qrun each 500#<99 99 1 >> > >> > Any failure would mean it is unfit for serious production use. >> > >> > I don't think the number of cores would affect its stability. >> > >> > Did you check task manager for any stranded jconsole instances? >> > >> > >> > On Oct 6, 2017 8:43 AM, "'Pascal Jasmin' via Programming" < >> > [email protected]> wrote: >> > >> > > with a separate program running on 6 cores, >> > > >> > > I can run in jqt without problem, >> > > >> > > qrun each 10 # < 99 5 3 >> > > >> > > >> > > However, most (many at least) runs with more tasks, fail >> > > >> > > btw, your suggestions to use jconsole with ctrl-c apply just fine with >> > jqt >> > > and jbreak.bat (and debug invoked at break) >> > > >> > > the logfile in ~temp, seems to just repeat the console output. >> > > >> > > There is a pattern to nearly all of the current failures: >> > > >> > > 1. It is hanging on terminating the last task "kill 98". All runs >> always >> > > print "finished lastjob task", and hang on killing the task of the >> last >> > > finish. (not always the last job to finish last) >> > > >> > > there is no noticeable effect on success from adding an x parameter. >> > > >> > > ________________________________ >> > > From: Eric Iverson <[email protected]> >> > > To: Programming forum <[email protected]> >> > > Sent: Thursday, October 5, 2017 4:43 PM >> > > Subject: [Jprogramming] qrun - jcs - zmq >> > > >> > > >> > > >> > > Pascal (and others interested in the qrun problem), >> > > >> > > >> > > I was happy when I was able to repeat the hang on my windows system. >> And >> > > >> > > then it went away. A race condition that depends on the weather? >> > > >> > > >> > > I have updated zmq/jcs addons with an improved qrun that logs more >> info. >> > > >> > > >> > > ctrl+c can be very useful in working with zmq. It is best to use >> jconsole >> > > >> > > in tracking down this problem. Jqt and JHS introduce unnecessary >> > > >> > > complications. >> > > >> > > >> > > Windows also complicates this as its support for ctrl+c has some >> problems >> > > >> > > vs zmq and sockets. >> > > >> > > >> > > In going over all the reports it seems that the problem is that one of >> > the >> > > >> > > early tasks started never finishes its first request. The problem >> seems >> > to >> > > >> > > be a race between starting the task and the first request to it. >> > > >> > > >> > > The new versions should help track this down. >> > > >> > > >> > > Please try the following and give back the results: >> > > >> > > >> > > 1. start jconsole >> > > >> > > load'~addons/net/jcs/qrun.ijs' >> > > >> > > qrun 99 99 1 >> > > >> > > >> > > Poll now has a timeout. If you see poll line repeated every 5 seconds, >> > you >> > > >> > > are likely hung waiting for something that isn't going to happen. The >> > good >> > > >> > > news is that your session should respond to ctrl+c within 5 seconds. >> > > >> > > >> > > qrun now writes a logfile that might have some hints as to the >> problem. >> > > >> > > After qrun has hung, and you have done ctrl+c, take a look at: fread >> > > >> > > logfile_jcs_ >> > > >> > > >> > > Please pass the contents of that file to me as it might hlep track >> this >> > > >> > > down. >> > > >> > > >> > > *** >> > > >> > > if it is a race between starting a task and sending it the 1st >> request, >> > the >> > > >> > > problem might 'go away' if we add a sleep between starting all the >> tasks >> > > >> > > and starting any jobs. This is not a fix, but provides more info. >> > > >> > > >> > > If you can get the hang repeatedly, please see if you the following >> > avoids >> > > >> > > the hang. >> > > >> > > >> > > 2 qrun 99 99 2 NB. sleep 2 seconds before starting requests >> > > >> > > >> > > *** >> > > >> > > Has anyone seen this problem on Linux? Can we say it is possibly a >> window >> > > >> > > only problem? >> > > >> > > ------------------------------------------------------------ >> ---------- >> > > >> > > For information about J forums see http://www.jsoftware.com/forum >> s.htm >> >> > >> > > ------------------------------------------------------------ >> ---------- >> > > For information about J forums see http://www.jsoftware.com/forum >> s.htm >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
