Hi all,

A number of my jobs keep dying during MERT, and I'm having trouble
tracking down what's going on. I submit all of my jobs using SGE, so
it's possible there's an interaction there.

Can anyone help me understand what's going on below:


sh: line 1: 29188 Killed
/free/lane/slm-merging-trunk/moses-cmd/src/moses -config
/scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.dist05.synlm50.ini
-inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d 0.053571
0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714
0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100
-input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de
> run1.out
Exit code: 137
The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286
-d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm
0.035714 0.035714 0.035714 0.035714 0.035714



I've searched for the meaning of exit code 137, and what I've read
says that's the exit code for a process that received kill signal 9.

I'm especially puzzled by "sh: line 1: 29188 Killed".

I'm pretty sure that the safesystem function in the moses-mert.pl
script is printing  "Exit code: 137", and I'm assuming that the moses
command itself is being launched by the "system(@_)" command within
that same safesystem function. But I don't know what is responsible
for printing "sh: line 1: 29188 Killed", or what "line 1" and "29188"
refer to.

For what it's worth, I'm attaching the results of running qacct -j on
the job after it died. I don't think it is relevant, but I guess it
could be.

Thanks,
Lane
==============================================================
qname        all.q               
hostname     quad19.scream.lab   
group        scream              
owner        lane                
project      NONE                
department   defaultdepartment   
jobname      de-en.mert          
jobnumber    20337               
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Mon Feb 13 14:08:54 2012
start_time   Mon Feb 13 14:09:05 2012
end_time     Wed Feb 15 14:54:52 2012
granted_pe   NONE                
slots        1                   
failed       0    
exit_status  2                   
ru_wallclock 175547       
ru_utime     175460.360   
ru_stime     21.147       
ru_maxrss    23910412            
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    6545996             
ru_majflt    7568                
ru_nswap     0                   
ru_inblock   3067192             
ru_oublock   22064               
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     9545                
ru_nivcsw    256918              
cpu          175481.507   
mem          2516411.448       
io           4.733             
iow          0.000             
maxvmem      25.026G
arid         undefined
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to