Hi Dingyuan I have your script and model running, but so far it has not reported any errors. It's at iteration 27, and I'm using the latest Moses from git.
How long should I expect it to run before it hits an error? Could it be affected by the locale setting? Have you tried running without xml-input to see if you still have the problem? cheers - Barry On 19/01/16 05:43, Dingyuan Wang wrote: > Hi Barry, > > I've uploaded the model: > https://mega.nz/#!UsVSBCBJ!e5IATFvLqrCb5zhmDekLn8NOGw4PSD9RRQLGQeKEvNY > > To test the model, I included a script 'repeatnbest.sh' which runs moses > repeatedly until encoding error occurs. > > The file run7.best100.out and run7.out in the archive is the last run > that produces the error. > > It seems that it is WordTranslationFeature that causes the problem. > > 在 2016年01月19日 00:03, Barry Haddow 写道: >> Hi Dingyuan >> >> Something is going wrong with the construction or outputting of feature >> names, and it looks like it's WordTranslationFeature that's the problem. >> Does the problem go away if you do not use word translation features? >> >> If you could make available a model that reproduces the nbest list >> construction then I would have a chance to debug it, >> >> cheers - Barry >> >> On 18/01/16 15:32, Dingyuan Wang wrote: >>> Hi Barry, >>> >>> I've checked all the models and corpora with the script, without finding >>> any encoding problem. >>> >>> I also find that all such errors in nbest list occurs only in the >>> feature list (3 different samples), without affecting translation >>> result. Therefore, the phrase table or training corpus may not be the >>> problem. >>> >>> 在 2016年01月18日 23:04, Barry Haddow 写道: >>>> Hi Dingyuan >>>> >>>> Are these encoding errors present in your phrase table? Are they present >>>> in your training corpus? Since they appear in the word translation >>>> features, and you are using a shortlist, are they in the shortlist files >>>> in the model directory? (These have names with "topn" in them afaik). >>>> >>>> File-system errors are unlikely, and for the most part Moses treats text >>>> as byte strings so encoding errors usually trace back to the source >>>> text. >>>> >>>> cheers - Barry >>>> >>>> On 18/01/16 14:56, Dingyuan Wang wrote: >>>>> Hi Barry, >>>>> >>>>> "The ones starting with the "@"" are due to corrupted bytes in the >>>>> nbest >>>>> list. >>>>> >>>>> This kind of corruption occurs from time to time. I wonder if it comes >>>>> from memory errors or filesystem failure or some kind of >>>>> pointer/encoding problem in moses. >>>>> >>>>> I've written a script to find such corrupted lines: >>>>> >>>>> https://gist.github.com/gumblex/0d9d0848b435e4f9818f >>>>> >>>>> 在 2016年01月18日 20:42, Barry Haddow 写道: >>>>>> Hi Dingyuan >>>>>> >>>>>> The extractor expects feature names to contain an underscore (not sure >>>>>> exactly why) but some of yours don't, and Moses skips them, >>>>>> interpreting >>>>>> their values as extra dense features. >>>>>> >>>>>> The attached screenshot shows my view of the offending names. The ones >>>>>> starting with the "@" are the problem. So it does look like the nbest >>>>>> list is corrupted. Can you run the decoder on just that sentence, to >>>>>> create an uncompressed version of the nbest list? >>>>>> >>>>>> cheers - Barry >>>>>> >>>>>> On 18/01/16 12:02, Dingyuan Wang wrote: >>>>>>> Hi Barry, >>>>>>> >>>>>>> Attached is the zgrep result. >>>>>>> I found that in the middle of line 61 a few bytes are corrupted. Is >>>>>>> that >>>>>>> a moses problem or my memory has a problem? >>>>>>> >>>>>>> I also checked other files using iconv, they are all OK in UTF-8. >>>>>>> >>>>>>> 在 2016年01月18日 19:32, Barry Haddow 写道: >>>>>>>> Hi Dingyuan >>>>>>>> >>>>>>>> Yes, that's very possible. The error could be in extracting >>>>>>>> features.dat >>>>>>>> from the nbest list. Are you able to post the nbest list? Or at >>>>>>>> least >>>>>>>> the entries for sentence 16? >>>>>>>> >>>>>>>> Run something like >>>>>>>> >>>>>>>> zgrep "^16 " tuning/tmp.1/run7.best100.out.gz >>>>>>>> >>>>>>>> cheers - Barry >>>>>>>> >>>>>>>> On 18/01/16 11:24, Dingyuan Wang wrote: >>>>>>>>> Hi Barry, >>>>>>>>> >>>>>>>>> I have rerun the ems after the first email, and then posted the >>>>>>>>> recent >>>>>>>>> results, so the line changed. >>>>>>>>> >>>>>>>>> I just use the latest code, and the EMS script. Pretty much are >>>>>>>>> default >>>>>>>>> settings. The EMS setting is: >>>>>>>>> >>>>>>>>> sparse-features = "target-word-insertion top 50, >>>>>>>>> source-word-deletion >>>>>>>>> top 50, word-translation top 50 50, phrase-length" >>>>>>>>> >>>>>>>>> I suspect there is something unexpected in the extractor. >>>>>>>>> >>>>>>>>> >>>>>>>>> 在 2016年01月18日 19:03, Barry Haddow 写道: >>>>>>>>>> Hi Dingyuan >>>>>>>>>> >>>>>>>>>> In fact it is not the sparse features nor the Asian characters >>>>>>>>>> that >>>>>>>>>> are >>>>>>>>>> the problem. The offending line has 17 dense features, yet your >>>>>>>>>> model >>>>>>>>>> has 14 dense features. >>>>>>>>>> >>>>>>>>>> The string "1 1 1" appears directly after the language model >>>>>>>>>> feature in >>>>>>>>>> line 1694, in your attachment, adding the extra 3 features. Note >>>>>>>>>> that >>>>>>>>>> this is not the line you mentioned in your earlier email. >>>>>>>>>> >>>>>>>>>> I have no idea why there are extra features. Have you made >>>>>>>>>> changes to >>>>>>>>>> any of the core Moses features? >>>>>>>>>> >>>>>>>>>> best wishes >>>>>>>>>> Barry >>>>>>>>>> >>>>>>>>>> The offending line: >>>>>>>>>> what(): Error in line "-5.44027 0 0 -5.34901 0 0 0 -224.872 1 1 >>>>>>>>>> 1 -39 >>>>>>>>>> 18 -26.2331 -40.6736 -44.3698 -82.5072 WT_,~,=3 WT_:~:=1 >>>>>>>>>> WT_“~“=1 >>>>>>>>>> WT_”~”=1 WT_曰~说=1 PL_s3=5 PL_3,2=2 PL_3,3=3 PL_2,3=4 PL_t3=7 >>>>>>>>>> PL_s1=5 >>>>>>>>>> PL_1,2=2 PL_1,1=3 PL_t1=4 PL_2,2=3 PL_t2=7 PL_s2=8 PL_2,1=1 WT_ >>>>>>>>>> 有~有=1 >>>>>>>>>> WT_!~!=1 WT_其~的=1 WT_其~他=1 WT_不~也=1 WT_不~没=1 WT_而~而=1 >>>>>>>>>> WT_而~ >>>>>>>>>> 却=1 WT_祖逖~逖=1 WT_祖逖~祖=1 WT_逖~祖=1 WT_逖~逖=1 WT_大~大江=1 >>>>>>>>>> WT_者~ >>>>>>>>>> 的=1 WT_者~人=1 WT_江~大江=1 WT_渡~渡过=1 WT_复~又=1 WT_余~有=1 >>>>>>>>>> WT_ >>>>>>>>>> 誓~发 >>>>>>>>>> 誓=1 WT_楫~木=1 WT_江~长江=1 WT_击~击=1 WT_将~带领=1 WT_济~成功=1 >>>>>>>>>> WT_中 >>>>>>>>>> 原~中原=1 WT_清~廓清=1 WT_如~像=1 WT_楫~戢=1 WT_能~能=1 WT_中~中 >>>>>>>>>> 流=1 WT_ >>>>>>>>>> 流~中流=1 WT_部曲~部下=1 " of ... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 18/01/16 10:37, Dingyuan Wang wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I've attached that. The line number is 1694. >>>>>>>>>>> >>>>>>>>>>> 在 2016年01月18日 16:43, Barry Haddow 写道: >>>>>>>>>>>> Hi Dingyuan >>>>>>>>>>>> >>>>>>>>>>>> Is it possible to attach the features.dat file that is >>>>>>>>>>>> causing the >>>>>>>>>>>> error? Almost certainly Moses is failing to parse the line >>>>>>>>>>>> because of >>>>>>>>>>>> the Asian characters in the feature names, >>>>>>>>>>>> >>>>>>>>>>>> cheers - Barry >>>>>>>>>>>> >>>>>>>>>>>> On 16/01/16 15:58, Dingyuan Wang wrote: >>>>>>>>>>>>> I ran >>>>>>>>>>>>> >>>>>>>>>>>>> ~/software/moses/bin/kbmira -J 75 --dense-init run7.dense >>>>>>>>>>>>> --sparse-init >>>>>>>>>>>>> run7.sparse-weights --ffile run1.features.dat --ffile >>>>>>>>>>>>> run2.features.dat >>>>>>>>>>>>> --ffile run3.features.dat --ffile run4.features.dat --ffile >>>>>>>>>>>>> run5.features.dat --ffile run6.features.dat --ffile >>>>>>>>>>>>> run7.features.dat >>>>>>>>>>>>> --scfile run1.scores.dat --scfile run2.scores.dat --scfile >>>>>>>>>>>>> run3.scores.dat --scfile run4.scores.dat --scfile >>>>>>>>>>>>> run5.scores.dat >>>>>>>>>>>>> --scfile run6.scores.dat --scfile run7.scores.dat -o >>>>>>>>>>>>> /tmp/mert.out >>>>>>>>>>>>> >>>>>>>>>>>>> in the tuning/tmp.1 directory, which will certainly >>>>>>>>>>>>> replicate the >>>>>>>>>>>>> error. >>>>>>>>>>>>> >>>>>>>>>>>>> 在 2016年01月16日 23:42, Hieu Hoang 写道: >>>>>>>>>>>>>> The mert script prints out every command it runs. You >>>>>>>>>>>>>> should be >>>>>>>>>>>>>> able to >>>>>>>>>>>>>> replicate the error by running the last command >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 16 Jan 2016 14:18, "Dingyuan Wang" <[email protected] >>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sorry, but I can't reliably replicate the same >>>>>>>>>>>>>> problem >>>>>>>>>>>>>> when >>>>>>>>>>>>>> running >>>>>>>>>>>>>> TUNING_tune.1 alone. There is no character '_' in >>>>>>>>>>>>>> the test >>>>>>>>>>>>>> set >>>>>>>>>>>>>> or top50 >>>>>>>>>>>>>> list. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm using sparse-features = "target-word-insertion >>>>>>>>>>>>>> top 50, >>>>>>>>>>>>>> source-word-deletion top 50, word-translation top 50 >>>>>>>>>>>>>> 50, >>>>>>>>>>>>>> phrase-length" >>>>>>>>>>>>>> >>>>>>>>>>>>>> I've attached some related files from EMS and the >>>>>>>>>>>>>> EMS >>>>>>>>>>>>>> config. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://mega.nz/#!xs0SFKxL!M_RTBp1JGX24-b4xlYYLP-bLXKiC_Sl-p96x55avAB4 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 在 2016年01月16日 02:45, Hieu Hoang 写道: >>>>>>>>>>>>>> > could you make your model files available for >>>>>>>>>>>>>> download so I >>>>>>>>>>>>>> can >>>>>>>>>>>>>> > replicate this problem. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > it seems like you're using a feature function with >>>>>>>>>>>>>> sparse >>>>>>>>>>>>>> scores. I >>>>>>>>>>>>>> > think the character '_' must be escaped. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > On 12/01/16 04:00, Dingyuan Wang wrote: >>>>>>>>>>>>>> >> Hi all, >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> I'm using EMS for doing experiments. Every >>>>>>>>>>>>>> time the >>>>>>>>>>>>>> kbmira >>>>>>>>>>>>>> died with >>>>>>>>>>>>>> >> SIGABRT when turning on one direction, while >>>>>>>>>>>>>> tuning >>>>>>>>>>>>>> on the >>>>>>>>>>>>>> opposite >>>>>>>>>>>>>> >> direction (same config and test set) was >>>>>>>>>>>>>> successful. >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> The mert.log (stderr) shows follows: >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> kbmira with c=0.01 decay=0.999 no_shuffle=0 >>>>>>>>>>>>>> >> Initialising random seed from system clock >>>>>>>>>>>>>> >> Found 15323 initial sparse features >>>>>>>>>>>>>> >> ....terminate called after throwing an >>>>>>>>>>>>>> instance of >>>>>>>>>>>>>> >> 'MosesTuning::FileFormatException' >>>>>>>>>>>>>> >> what(): Error in line "-4.51933 0 0 -6.09733 >>>>>>>>>>>>>> 0 0 0 >>>>>>>>>>>>>> -121.556 2 >>>>>>>>>>>>>> -20 12 >>>>>>>>>>>>>> >> -31.6201 -38.5211 -26.5112 -60.6166 WT_,~,=2 >>>>>>>>>>>>>> WT_?~?=1 >>>>>>>>>>>>>> PL_s1=4 >>>>>>>>>>>>>> >> PL_s3=1 PL_3,3=1 PL_2,2=3 PL_1,2=1 PL_2,1=3 >>>>>>>>>>>>>> PL_t1=6 >>>>>>>>>>>>>> PL_t2=4 >>>>>>>>>>>>>> PL_t3=2 >>>>>>>>>>>>>> >> PL_2,3=1 PL_s2=7 PL_1,1=3 WT_未~没有=1 WT_何~ >>>>>>>>>>>>>> 怎么=1 >>>>>>>>>>>>>> WT_何~ >>>>>>>>>>>>>> 能=1 >>>>>>>>>>>>>> WT_方~正 >>>>>>>>>>>>>> >> 在=1 WT_又~还=1 WT_君~您=2 WT_趣~向=1 WT_趣~奔=1 >>>>>>>>>>>>>> WT_有~ >>>>>>>>>>>>>> 没有=1 >>>>>>>>>>>>>> WT_ >>>>>>>>>>>>>> 往~去=1 >>>>>>>>>>>>>> >> WT_官~官员=1 WT_假~借=1 WT_檄~檄文=1 WT_文~文告=1 >>>>>>>>>>>>>> WT_上~上 >>>>>>>>>>>>>> 级=1 WT_为~ >>>>>>>>>>>>>> >> 呢=1 WT_在~正在=1 " of run7.features.dat >>>>>>>>>>>>>> >> Aborted >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> I think since run7.scores.dat is generated by >>>>>>>>>>>>>> some >>>>>>>>>>>>>> scripts, I >>>>>>>>>>>>>> wouldn't >>>>>>>>>>>>>> >> be responsible for making the bad format. Last >>>>>>>>>>>>>> time it >>>>>>>>>>>>>> also >>>>>>>>>>>>>> died, I >>>>>>>>>>>>>> >> removed the likely offending line in the test >>>>>>>>>>>>>> set, but >>>>>>>>>>>>>> this time >>>>>>>>>>>>>> another >>>>>>>>>>>>>> >> line appears. >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>> >> Dingyuan Wang >>>>>>>>>>>>>> >> _______________________________________________ >>>>>>>>>>>>>> >> Moses-support mailing list >>>>>>>>>>>>>> >> [email protected] >>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>>>>>>>> > >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Dingyuan Wang (gumblex) >>>>>>>>>>>>>> >> -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
