You're right about the null byte, and given the time I've been spending on this training I'm definitely interested in any shortcut that would avoid my having to start from scratch!
The data I'm training on is not Chinese UN data but a pretty large dump of Microsoft software strings in English and French. Thanks a bunch, Mirko -----Original Message----- From: John Burger [mailto:[email protected]] Sent: Tuesday, May 12, 2009 2:04 PM To: Mirko Plitt Cc: [email protected] Subject: Re: [Moses-support] PhraseScore dies with signal 11 Mirko Plitt wrote: > Loading lexical translation table from ./model/lex.f2eline 2 in ./ > model/lex.f2e > has wrong number of tokens, skipping: > 0 ERROR: Execution of: /usr/bin/training/phrase-extract/score ./ > model/extract.so > rted ./model/lex.f2e ./model/phrase-table.half.f2e > died with signal 11, without coredump In my experience this means you have a null byte in your data. Did you look at line 2 of model/lex.f2e? I suspect you will find what looks like garbage, depending on what you view it with. Try this to find lines with null bytes in your original data: grep -Pc '[\000]' <files ...> (If your grep doesn't support Perl -style regepx syntax (-P), you'll have to express that a different way.) If this turns out to be the problem, and you don't want to run GIZA again from scratch, let me know and I can tell you how I've hacked up the files in ./model/ to restart the Moses training script from step 5. By the way, do you happen to be using the Chinese UN data? I've found that two years of this data are pretty screwed up, including null bytes. These files obviously got corrupted at some point. I find the UN data to be very frustrating, since it's odd and messy in many different ways. But such large portions! - John Burger MITRE _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
