J C Read a écrit :
According to wikipedia http://en.wikipedia.org/wiki/SIGSEGV signal 11 indicates
an invalid memory reference.
Yes, definitely, what we also call a "coredump" under AIX.
I eventually figured out that this was because of the data I was using.
That's often the case, an unfortunate data condition that is unexpected
and unaccounted for in error recovery. That's usually hard to track,
though...
Things to check:
Is the data sentence aligned?
Yes, europarl.lowercased.0-0.fr has 73835 lines:
reprise de la session
je déclare reprise la session du parlement européen qui avait (...)
(...)
des paroles , pas d' action .
en attendant , deux mille personnes ont perdu la vie inutilement , (...)
and europarl.lowercased.0-0.en has 73835 lines:
resumption of the session
i declare resumed the session of the european parliament adjourned
on (...)
(...)
more talk . no action .
meanwhile , two thousand people in the last year have needlessly (...)
Has the data been cleaned with the clean script? (try using sentences of min 1
and max 100)
Yes, it went through the script, with the recommended parameters:
|
bin/moses-scripts/scripts-||/YYYYMMDD-HHMM/||/training/clean-corpus-n.perl
working-dir/corpus/europarl.tok fr en working-dir/corpus/europarl.clean
1 40|
which reduced the number of sentences from the initial 100K to 73835.
Any other suggestions?
Say, it could not be that the very smallness of my training data (only
73K sentences) could be causing unexpected underflows or whatever in
GIZA, could it?
Does it not make sense to try and run the whole process on a small
dataset to start with (I don't have access to powerful machines at the
moment, running this on my personal laptop...) ?
Thanks for your support, much appreciated.
--
Hubert Crépy
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support