MADA can create tokens that are bar characters (ie | ) you need to rename them to something like BAR. Moses treats these as factor delimiters, hence the message you are seeing
(i've been using MADA+TOKAN for Arabic, using the D2 setting) Miles On 7 May 2010 23:26, David Edelstein <[email protected]> wrote: > Hello, > > I'm using Moses to do some SMT on Arabic, experimenting with > diacritized vs. undiacritized Arabic training corpora. (I am using > MADA+TOKAN to perform automatic diacritization.) So, if anyone happens > to be specifically interested in Arabic, has some tips on using Moses > for Arabic (right now I am just trying to get a baseline system > running, so I haven't even begun exploring which parameters I need to > tweak from the defaults), or can give me any other insights, I'd be > very pleased to talk to you about it off-list; please email me. > > Now, I have a specific question and a specific problem, to which I > have not found a solution by searching the archives. > > 1. There are two scripts referenced in scripts/released-files (read by > the scripts Makefile): > training/train-factored-phrase-model.perl > training/filter-and-binarize-model-given-input.pl > > These scripts do not exist in the most recent SVN release so 'make > release' reports an error since obviously it cannot install them. > > The tutorials alternately reference train-factored-phrase-model.perl > and train-model.perl; reading the latter, it seems to do factored > training. Is this just an error (and something that should be updated > in the online docs and released-files), and I should only be using > train-model.perl? Or is there a difference between the two scripts? > And is the same true of > training/filter-and-binarize-model-given-input.pl vs. > filter-model-given-input.pl? > > 2. I went through the entire tutorial using the French-English > Europarl data sets, and got it working. Now I'm going through the same > process with my Arabic-English parallel corpora. I've gotten as far as > tuning. I've been trying to use train-model.perl, and it gets to this > part: > > "<my-moses-dir>/moses-cmd/src/moses -v 0 -config > <my-model-dir>/moses.ini -inputtype 0 -w 0.000000 -lm 0.333333 -d > 0.333333 -tm 0.100000 0.066667 0.100000 0.066667 0.000000 > -n-best-list run1.best100.out 100 -i <my-arabic-input-file> > run1.out > > It generates run1.best100.out and run1.out, but then chokes with this > error message: > > Translation took 0.060 seconds > Finished translating > [ERROR] Malformed input at > Expected input to have words composed of 1 factor(s) (form FAC1|FAC2|...) > but instead received input with 2 factor(s). > Aborted > > So I gather somewhere I have a setting wrong, but I cannot figure out > where it is. I basically followed the exact same steps with my > Arabic-English corpora as in the tutorial, just substituting my own > training data. I'm not trying to do factored training at this time. > > Any advice appreciated. Thanks! > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
