This really looks like the OOM killer. You can verify this by looking at dmesg on the node.
Kenneth On 02/17/2012 03:46 PM, Lane Schwartz wrote: > I do wonder whether the grid engine killed the job, but according to > qmon, all of the hard and soft limits are set to INFINITY, so I don't > know why it would have killed the job. > > I'm posting that question to the SGE mailing list, so maybe they'll > have an idea. > > Thanks, > Lane > > > On Fri, Feb 17, 2012 at 3:39 PM, Barry Haddow > <[email protected]> wrote: >> Hi Lane >> >> According to your qacct, your job was running for over 2 days and hit 25G of >> memory. Could the system have killed the job for exceeding some resource >> limit? >> >> I think it's the shell that prints 'sh: line 1: 29188 Killed', and the 29188 >> is the pid. >> >> cheers - Barry >> >> On Friday 17 Feb 2012 19:41:35 Lane Schwartz wrote: >>> Hi all, >>> >>> A number of my jobs keep dying during MERT, and I'm having trouble >>> tracking down what's going on. I submit all of my jobs using SGE, so >>> it's possible there's an interaction there. >>> >>> Can anyone help me understand what's going on below: >>> >>> >>> sh: line 1: 29188 Killed >>> /free/lane/slm-merging-trunk/moses-cmd/src/moses -config >>> /scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20. >>> dist05.synlm50.ini -inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d >>> 0.053571 >>> 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714 >>> 0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100 >>> -input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de >>> >>>> run1.out >>> >>> Exit code: 137 >>> The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286 >>> -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm >>> 0.035714 0.035714 0.035714 0.035714 0.035714 >>> >>> >>> >>> I've searched for the meaning of exit code 137, and what I've read >>> says that's the exit code for a process that received kill signal 9. >>> >>> I'm especially puzzled by "sh: line 1: 29188 Killed". >>> >>> I'm pretty sure that the safesystem function in the moses-mert.pl >>> script is printing "Exit code: 137", and I'm assuming that the moses >>> command itself is being launched by the "system(@_)" command within >>> that same safesystem function. But I don't know what is responsible >>> for printing "sh: line 1: 29188 Killed", or what "line 1" and "29188" >>> refer to. >>> >>> For what it's worth, I'm attaching the results of running qacct -j on >>> the job after it died. I don't think it is relevant, but I guess it >>> could be. >>> >>> Thanks, >>> Lane >>> > > > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
