Hi Dingyuan Are these encoding errors present in your phrase table? Are they present in your training corpus? Since they appear in the word translation features, and you are using a shortlist, are they in the shortlist files in the model directory? (These have names with "topn" in them afaik).
File-system errors are unlikely, and for the most part Moses treats text as byte strings so encoding errors usually trace back to the source text. cheers - Barry On 18/01/16 14:56, Dingyuan Wang wrote: > Hi Barry, > > "The ones starting with the "@"" are due to corrupted bytes in the nbest > list. > > This kind of corruption occurs from time to time. I wonder if it comes > from memory errors or filesystem failure or some kind of > pointer/encoding problem in moses. > > I've written a script to find such corrupted lines: > > https://gist.github.com/gumblex/0d9d0848b435e4f9818f > > 在 2016年01月18日 20:42, Barry Haddow 写道: >> Hi Dingyuan >> >> The extractor expects feature names to contain an underscore (not sure >> exactly why) but some of yours don't, and Moses skips them, interpreting >> their values as extra dense features. >> >> The attached screenshot shows my view of the offending names. The ones >> starting with the "@" are the problem. So it does look like the nbest >> list is corrupted. Can you run the decoder on just that sentence, to >> create an uncompressed version of the nbest list? >> >> cheers - Barry >> >> On 18/01/16 12:02, Dingyuan Wang wrote: >>> Hi Barry, >>> >>> Attached is the zgrep result. >>> I found that in the middle of line 61 a few bytes are corrupted. Is that >>> a moses problem or my memory has a problem? >>> >>> I also checked other files using iconv, they are all OK in UTF-8. >>> >>> 在 2016年01月18日 19:32, Barry Haddow 写道: >>>> Hi Dingyuan >>>> >>>> Yes, that's very possible. The error could be in extracting features.dat >>>> from the nbest list. Are you able to post the nbest list? Or at least >>>> the entries for sentence 16? >>>> >>>> Run something like >>>> >>>> zgrep "^16 " tuning/tmp.1/run7.best100.out.gz >>>> >>>> cheers - Barry >>>> >>>> On 18/01/16 11:24, Dingyuan Wang wrote: >>>>> Hi Barry, >>>>> >>>>> I have rerun the ems after the first email, and then posted the recent >>>>> results, so the line changed. >>>>> >>>>> I just use the latest code, and the EMS script. Pretty much are default >>>>> settings. The EMS setting is: >>>>> >>>>> sparse-features = "target-word-insertion top 50, source-word-deletion >>>>> top 50, word-translation top 50 50, phrase-length" >>>>> >>>>> I suspect there is something unexpected in the extractor. >>>>> >>>>> >>>>> 在 2016年01月18日 19:03, Barry Haddow 写道: >>>>>> Hi Dingyuan >>>>>> >>>>>> In fact it is not the sparse features nor the Asian characters that >>>>>> are >>>>>> the problem. The offending line has 17 dense features, yet your model >>>>>> has 14 dense features. >>>>>> >>>>>> The string "1 1 1" appears directly after the language model >>>>>> feature in >>>>>> line 1694, in your attachment, adding the extra 3 features. Note that >>>>>> this is not the line you mentioned in your earlier email. >>>>>> >>>>>> I have no idea why there are extra features. Have you made changes to >>>>>> any of the core Moses features? >>>>>> >>>>>> best wishes >>>>>> Barry >>>>>> >>>>>> The offending line: >>>>>> what(): Error in line "-5.44027 0 0 -5.34901 0 0 0 -224.872 1 1 1 -39 >>>>>> 18 -26.2331 -40.6736 -44.3698 -82.5072 WT_,~,=3 WT_:~:=1 WT_“~“=1 >>>>>> WT_”~”=1 WT_曰~说=1 PL_s3=5 PL_3,2=2 PL_3,3=3 PL_2,3=4 PL_t3=7 PL_s1=5 >>>>>> PL_1,2=2 PL_1,1=3 PL_t1=4 PL_2,2=3 PL_t2=7 PL_s2=8 PL_2,1=1 WT_有~有=1 >>>>>> WT_!~!=1 WT_其~的=1 WT_其~他=1 WT_不~也=1 WT_不~没=1 WT_而~而=1 >>>>>> WT_而~ >>>>>> 却=1 WT_祖逖~逖=1 WT_祖逖~祖=1 WT_逖~祖=1 WT_逖~逖=1 WT_大~大江=1 >>>>>> WT_者~ >>>>>> 的=1 WT_者~人=1 WT_江~大江=1 WT_渡~渡过=1 WT_复~又=1 WT_余~有=1 WT_ >>>>>> 誓~发 >>>>>> 誓=1 WT_楫~木=1 WT_江~长江=1 WT_击~击=1 WT_将~带领=1 WT_济~成功=1 >>>>>> WT_中 >>>>>> 原~中原=1 WT_清~廓清=1 WT_如~像=1 WT_楫~戢=1 WT_能~能=1 WT_中~中 >>>>>> 流=1 WT_ >>>>>> 流~中流=1 WT_部曲~部下=1 " of ... >>>>>> >>>>>> >>>>>> On 18/01/16 10:37, Dingyuan Wang wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I've attached that. The line number is 1694. >>>>>>> >>>>>>> 在 2016年01月18日 16:43, Barry Haddow 写道: >>>>>>>> Hi Dingyuan >>>>>>>> >>>>>>>> Is it possible to attach the features.dat file that is causing the >>>>>>>> error? Almost certainly Moses is failing to parse the line >>>>>>>> because of >>>>>>>> the Asian characters in the feature names, >>>>>>>> >>>>>>>> cheers - Barry >>>>>>>> >>>>>>>> On 16/01/16 15:58, Dingyuan Wang wrote: >>>>>>>>> I ran >>>>>>>>> >>>>>>>>> ~/software/moses/bin/kbmira -J 75 --dense-init run7.dense >>>>>>>>> --sparse-init >>>>>>>>> run7.sparse-weights --ffile run1.features.dat --ffile >>>>>>>>> run2.features.dat >>>>>>>>> --ffile run3.features.dat --ffile run4.features.dat --ffile >>>>>>>>> run5.features.dat --ffile run6.features.dat --ffile >>>>>>>>> run7.features.dat >>>>>>>>> --scfile run1.scores.dat --scfile run2.scores.dat --scfile >>>>>>>>> run3.scores.dat --scfile run4.scores.dat --scfile run5.scores.dat >>>>>>>>> --scfile run6.scores.dat --scfile run7.scores.dat -o /tmp/mert.out >>>>>>>>> >>>>>>>>> in the tuning/tmp.1 directory, which will certainly replicate the >>>>>>>>> error. >>>>>>>>> >>>>>>>>> 在 2016年01月16日 23:42, Hieu Hoang 写道: >>>>>>>>>> The mert script prints out every command it runs. You should be >>>>>>>>>> able to >>>>>>>>>> replicate the error by running the last command >>>>>>>>>> >>>>>>>>>> On 16 Jan 2016 14:18, "Dingyuan Wang" <abcdoyle...@gmail.com >>>>>>>>>> <mailto:abcdoyle...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> Sorry, but I can't reliably replicate the same problem >>>>>>>>>> when >>>>>>>>>> running >>>>>>>>>> TUNING_tune.1 alone. There is no character '_' in the test >>>>>>>>>> set >>>>>>>>>> or top50 >>>>>>>>>> list. >>>>>>>>>> >>>>>>>>>> I'm using sparse-features = "target-word-insertion top 50, >>>>>>>>>> source-word-deletion top 50, word-translation top 50 50, >>>>>>>>>> phrase-length" >>>>>>>>>> >>>>>>>>>> I've attached some related files from EMS and the EMS >>>>>>>>>> config. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://mega.nz/#!xs0SFKxL!M_RTBp1JGX24-b4xlYYLP-bLXKiC_Sl-p96x55avAB4 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 在 2016年01月16日 02:45, Hieu Hoang 写道: >>>>>>>>>> > could you make your model files available for >>>>>>>>>> download so I >>>>>>>>>> can >>>>>>>>>> > replicate this problem. >>>>>>>>>> > >>>>>>>>>> > it seems like you're using a feature function with >>>>>>>>>> sparse >>>>>>>>>> scores. I >>>>>>>>>> > think the character '_' must be escaped. >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > On 12/01/16 04:00, Dingyuan Wang wrote: >>>>>>>>>> >> Hi all, >>>>>>>>>> >> >>>>>>>>>> >> I'm using EMS for doing experiments. Every time the >>>>>>>>>> kbmira >>>>>>>>>> died with >>>>>>>>>> >> SIGABRT when turning on one direction, while tuning >>>>>>>>>> on the >>>>>>>>>> opposite >>>>>>>>>> >> direction (same config and test set) was successful. >>>>>>>>>> >> >>>>>>>>>> >> The mert.log (stderr) shows follows: >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> kbmira with c=0.01 decay=0.999 no_shuffle=0 >>>>>>>>>> >> Initialising random seed from system clock >>>>>>>>>> >> Found 15323 initial sparse features >>>>>>>>>> >> ....terminate called after throwing an instance of >>>>>>>>>> >> 'MosesTuning::FileFormatException' >>>>>>>>>> >> what(): Error in line "-4.51933 0 0 -6.09733 0 0 0 >>>>>>>>>> -121.556 2 >>>>>>>>>> -20 12 >>>>>>>>>> >> -31.6201 -38.5211 -26.5112 -60.6166 WT_,~,=2 >>>>>>>>>> WT_?~?=1 >>>>>>>>>> PL_s1=4 >>>>>>>>>> >> PL_s3=1 PL_3,3=1 PL_2,2=3 PL_1,2=1 PL_2,1=3 PL_t1=6 >>>>>>>>>> PL_t2=4 >>>>>>>>>> PL_t3=2 >>>>>>>>>> >> PL_2,3=1 PL_s2=7 PL_1,1=3 WT_未~没有=1 WT_何~怎么=1 >>>>>>>>>> WT_何~ >>>>>>>>>> 能=1 >>>>>>>>>> WT_方~正 >>>>>>>>>> >> 在=1 WT_又~还=1 WT_君~您=2 WT_趣~向=1 WT_趣~奔=1 WT_有~ >>>>>>>>>> 没有=1 >>>>>>>>>> WT_ >>>>>>>>>> 往~去=1 >>>>>>>>>> >> WT_官~官员=1 WT_假~借=1 WT_檄~檄文=1 WT_文~文告=1 >>>>>>>>>> WT_上~上 >>>>>>>>>> 级=1 WT_为~ >>>>>>>>>> >> 呢=1 WT_在~正在=1 " of run7.features.dat >>>>>>>>>> >> Aborted >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> I think since run7.scores.dat is generated by some >>>>>>>>>> scripts, I >>>>>>>>>> wouldn't >>>>>>>>>> >> be responsible for making the bad format. Last time it >>>>>>>>>> also >>>>>>>>>> died, I >>>>>>>>>> >> removed the likely offending line in the test set, but >>>>>>>>>> this time >>>>>>>>>> another >>>>>>>>>> >> line appears. >>>>>>>>>> >> >>>>>>>>>> >> -- >>>>>>>>>> >> Dingyuan Wang >>>>>>>>>> >> _______________________________________________ >>>>>>>>>> >> Moses-support mailing list >>>>>>>>>> >> Moses-support@mit.edu <mailto:Moses-support@mit.edu> >>>>>>>>>> >> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>>>> > >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Dingyuan Wang (gumblex) >>>>>>>>>> >> -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support