This really looks like the OOM killer.  You can verify this by looking 
at dmesg on the node.

Kenneth

On 02/17/2012 03:46 PM, Lane Schwartz wrote:
> I do wonder whether the grid engine killed the job, but according to
> qmon, all of the hard and soft limits are set to INFINITY, so I don't
> know why it would have killed the job.
>
> I'm posting that question to the SGE mailing list, so maybe they'll
> have an idea.
>
> Thanks,
> Lane
>
>
> On Fri, Feb 17, 2012 at 3:39 PM, Barry Haddow
> <[email protected]>  wrote:
>> Hi Lane
>>
>> According to your qacct, your job was running for over 2 days and hit 25G of
>> memory. Could the system have killed the job for exceeding some resource
>> limit?
>>
>> I think it's the shell that prints 'sh: line 1: 29188 Killed', and the 29188
>> is the pid.
>>
>> cheers - Barry
>>
>> On Friday 17 Feb 2012 19:41:35 Lane Schwartz wrote:
>>> Hi all,
>>>
>>> A number of my jobs keep dying during MERT, and I'm having trouble
>>> tracking down what's going on. I submit all of my jobs using SGE, so
>>> it's possible there's an interaction there.
>>>
>>> Can anyone help me understand what's going on below:
>>>
>>>
>>> sh: line 1: 29188 Killed
>>> /free/lane/slm-merging-trunk/moses-cmd/src/moses -config
>>> /scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.
>>> dist05.synlm50.ini -inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d
>>>   0.053571
>>> 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714
>>> 0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100
>>> -input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de
>>>
>>>> run1.out
>>>
>>> Exit code: 137
>>> The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286
>>> -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm
>>> 0.035714 0.035714 0.035714 0.035714 0.035714
>>>
>>>
>>>
>>> I've searched for the meaning of exit code 137, and what I've read
>>> says that's the exit code for a process that received kill signal 9.
>>>
>>> I'm especially puzzled by "sh: line 1: 29188 Killed".
>>>
>>> I'm pretty sure that the safesystem function in the moses-mert.pl
>>> script is printing  "Exit code: 137", and I'm assuming that the moses
>>> command itself is being launched by the "system(@_)" command within
>>> that same safesystem function. But I don't know what is responsible
>>> for printing "sh: line 1: 29188 Killed", or what "line 1" and "29188"
>>> refer to.
>>>
>>> For what it's worth, I'm attaching the results of running qacct -j on
>>> the job after it died. I don't think it is relevant, but I guess it
>>> could be.
>>>
>>> Thanks,
>>> Lane
>>>
>
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to