Good call, Ken. I looked in dmesg on the node:

Out of memory: kill process 26224 (bash) score 1277831 or a child
Killed process 26493 (moses) vsz:20220460kB, anon-rss:16643824kB, file-rss:404kb


That node has 32 GB of RAM installed. So I guess the OS must have
decided that moses was using too much memory and killed it. Mystery
solved.

Thanks,
Lane


On Fri, Feb 17, 2012 at 4:30 PM, Lane Schwartz <[email protected]> wrote:
> Kenneth,
>
> What is the OOM killer?
>
> Thanks,
> Lane
>
> On Fri, Feb 17, 2012 at 4:08 PM, Kenneth Heafield <[email protected]> wrote:
>> This really looks like the OOM killer.  You can verify this by looking
>> at dmesg on the node.
>>
>> Kenneth
>>
>> On 02/17/2012 03:46 PM, Lane Schwartz wrote:
>>> I do wonder whether the grid engine killed the job, but according to
>>> qmon, all of the hard and soft limits are set to INFINITY, so I don't
>>> know why it would have killed the job.
>>>
>>> I'm posting that question to the SGE mailing list, so maybe they'll
>>> have an idea.
>>>
>>> Thanks,
>>> Lane
>>>
>>>
>>> On Fri, Feb 17, 2012 at 3:39 PM, Barry Haddow
>>> <[email protected]>  wrote:
>>>> Hi Lane
>>>>
>>>> According to your qacct, your job was running for over 2 days and hit 25G 
>>>> of
>>>> memory. Could the system have killed the job for exceeding some resource
>>>> limit?
>>>>
>>>> I think it's the shell that prints 'sh: line 1: 29188 Killed', and the 
>>>> 29188
>>>> is the pid.
>>>>
>>>> cheers - Barry
>>>>
>>>> On Friday 17 Feb 2012 19:41:35 Lane Schwartz wrote:
>>>>> Hi all,
>>>>>
>>>>> A number of my jobs keep dying during MERT, and I'm having trouble
>>>>> tracking down what's going on. I submit all of my jobs using SGE, so
>>>>> it's possible there's an interaction there.
>>>>>
>>>>> Can anyone help me understand what's going on below:
>>>>>
>>>>>
>>>>> sh: line 1: 29188 Killed
>>>>> /free/lane/slm-merging-trunk/moses-cmd/src/moses -config
>>>>> /scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.
>>>>> dist05.synlm50.ini -inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d
>>>>>   0.053571
>>>>> 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714
>>>>> 0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100
>>>>> -input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de
>>>>>
>>>>>> run1.out
>>>>>
>>>>> Exit code: 137
>>>>> The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286
>>>>> -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm
>>>>> 0.035714 0.035714 0.035714 0.035714 0.035714
>>>>>
>>>>>
>>>>>
>>>>> I've searched for the meaning of exit code 137, and what I've read
>>>>> says that's the exit code for a process that received kill signal 9.
>>>>>
>>>>> I'm especially puzzled by "sh: line 1: 29188 Killed".
>>>>>
>>>>> I'm pretty sure that the safesystem function in the moses-mert.pl
>>>>> script is printing  "Exit code: 137", and I'm assuming that the moses
>>>>> command itself is being launched by the "system(@_)" command within
>>>>> that same safesystem function. But I don't know what is responsible
>>>>> for printing "sh: line 1: 29188 Killed", or what "line 1" and "29188"
>>>>> refer to.
>>>>>
>>>>> For what it's worth, I'm attaching the results of running qacct -j on
>>>>> the job after it died. I don't think it is relevant, but I guess it
>>>>> could be.
>>>>>
>>>>> Thanks,
>>>>> Lane
>>>>>
>>>
>>>
>>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to