Hi Barry, It usually hits an error in about 1~10 iterations on my laptop. I don't know what triggers that, so it may be a probability problem.
Disabling xml-input won't help. I think I should use verbose output. My locale settings is: LANG=zh_CN.UTF-8 LANGUAGE=zh_CN.UTF-8:zh_TW.UTF-8:zh_HK.utf8:en_US.utf8 LC_CTYPE="zh_CN.UTF-8" LC_NUMERIC="zh_CN.UTF-8" LC_TIME="zh_CN.UTF-8" LC_COLLATE="zh_CN.UTF-8" LC_MONETARY="zh_CN.UTF-8" LC_MESSAGES="zh_CN.UTF-8" LC_PAPER="zh_CN.UTF-8" LC_NAME="zh_CN.UTF-8" LC_ADDRESS="zh_CN.UTF-8" LC_TELEPHONE="zh_CN.UTF-8" LC_MEASUREMENT="zh_CN.UTF-8" LC_IDENTIFICATION="zh_CN.UTF-8" LC_ALL= 在 2016年01月19日 19:20, Barry Haddow 写道: > Hi Dingyuan > > I have your script and model running, but so far it has not reported any > errors. It's at iteration 27, and I'm using the latest Moses from git. > > How long should I expect it to run before it hits an error? Could it be > affected by the locale setting? > > Have you tried running without xml-input to see if you still have the > problem? > > cheers - Barry > > On 19/01/16 05:43, Dingyuan Wang wrote: >> Hi Barry, >> >> I've uploaded the model: >> https://mega.nz/#!UsVSBCBJ!e5IATFvLqrCb5zhmDekLn8NOGw4PSD9RRQLGQeKEvNY >> >> To test the model, I included a script 'repeatnbest.sh' which runs moses >> repeatedly until encoding error occurs. >> >> The file run7.best100.out and run7.out in the archive is the last run >> that produces the error. >> >> It seems that it is WordTranslationFeature that causes the problem. >> >> 在 2016年01月19日 00:03, Barry Haddow 写道: >>> Hi Dingyuan >>> >>> Something is going wrong with the construction or outputting of feature >>> names, and it looks like it's WordTranslationFeature that's the problem. >>> Does the problem go away if you do not use word translation features? >>> >>> If you could make available a model that reproduces the nbest list >>> construction then I would have a chance to debug it, >>> >>> cheers - Barry >>> >>> On 18/01/16 15:32, Dingyuan Wang wrote: >>>> Hi Barry, >>>> >>>> I've checked all the models and corpora with the script, without >>>> finding >>>> any encoding problem. >>>> >>>> I also find that all such errors in nbest list occurs only in the >>>> feature list (3 different samples), without affecting translation >>>> result. Therefore, the phrase table or training corpus may not be the >>>> problem. >>>> >>>> 在 2016年01月18日 23:04, Barry Haddow 写道: >>>>> Hi Dingyuan >>>>> >>>>> Are these encoding errors present in your phrase table? Are they >>>>> present >>>>> in your training corpus? Since they appear in the word translation >>>>> features, and you are using a shortlist, are they in the shortlist >>>>> files >>>>> in the model directory? (These have names with "topn" in them afaik). >>>>> >>>>> File-system errors are unlikely, and for the most part Moses treats >>>>> text >>>>> as byte strings so encoding errors usually trace back to the source >>>>> text. >>>>> >>>>> cheers - Barry >>>>> >>>>> On 18/01/16 14:56, Dingyuan Wang wrote: >>>>>> Hi Barry, >>>>>> >>>>>> "The ones starting with the "@"" are due to corrupted bytes in the >>>>>> nbest >>>>>> list. >>>>>> >>>>>> This kind of corruption occurs from time to time. I wonder if it >>>>>> comes >>>>>> from memory errors or filesystem failure or some kind of >>>>>> pointer/encoding problem in moses. >>>>>> >>>>>> I've written a script to find such corrupted lines: >>>>>> >>>>>> https://gist.github.com/gumblex/0d9d0848b435e4f9818f >>>>>> >>>>>> 在 2016年01月18日 20:42, Barry Haddow 写道: >>>>>>> Hi Dingyuan >>>>>>> >>>>>>> The extractor expects feature names to contain an underscore (not >>>>>>> sure >>>>>>> exactly why) but some of yours don't, and Moses skips them, >>>>>>> interpreting >>>>>>> their values as extra dense features. >>>>>>> >>>>>>> The attached screenshot shows my view of the offending names. The >>>>>>> ones >>>>>>> starting with the "@" are the problem. So it does look like the >>>>>>> nbest >>>>>>> list is corrupted. Can you run the decoder on just that sentence, to >>>>>>> create an uncompressed version of the nbest list? >>>>>>> >>>>>>> cheers - Barry >>>>>>> >>>>>>> On 18/01/16 12:02, Dingyuan Wang wrote: >>>>>>>> Hi Barry, >>>>>>>> >>>>>>>> Attached is the zgrep result. >>>>>>>> I found that in the middle of line 61 a few bytes are corrupted. Is >>>>>>>> that >>>>>>>> a moses problem or my memory has a problem? >>>>>>>> >>>>>>>> I also checked other files using iconv, they are all OK in UTF-8. >>>>>>>> >>>>>>>> 在 2016年01月18日 19:32, Barry Haddow 写道: >>>>>>>>> Hi Dingyuan >>>>>>>>> >>>>>>>>> Yes, that's very possible. The error could be in extracting >>>>>>>>> features.dat >>>>>>>>> from the nbest list. Are you able to post the nbest list? Or at >>>>>>>>> least >>>>>>>>> the entries for sentence 16? >>>>>>>>> >>>>>>>>> Run something like >>>>>>>>> >>>>>>>>> zgrep "^16 " tuning/tmp.1/run7.best100.out.gz >>>>>>>>> >>>>>>>>> cheers - Barry >>>>>>>>> >>>>>>>>> On 18/01/16 11:24, Dingyuan Wang wrote: >>>>>>>>>> Hi Barry, >>>>>>>>>> >>>>>>>>>> I have rerun the ems after the first email, and then posted the >>>>>>>>>> recent >>>>>>>>>> results, so the line changed. >>>>>>>>>> >>>>>>>>>> I just use the latest code, and the EMS script. Pretty much are >>>>>>>>>> default >>>>>>>>>> settings. The EMS setting is: >>>>>>>>>> >>>>>>>>>> sparse-features = "target-word-insertion top 50, >>>>>>>>>> source-word-deletion >>>>>>>>>> top 50, word-translation top 50 50, phrase-length" >>>>>>>>>> >>>>>>>>>> I suspect there is something unexpected in the extractor. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 在 2016年01月18日 19:03, Barry Haddow 写道: >>>>>>>>>>> Hi Dingyuan >>>>>>>>>>> >>>>>>>>>>> In fact it is not the sparse features nor the Asian characters >>>>>>>>>>> that >>>>>>>>>>> are >>>>>>>>>>> the problem. The offending line has 17 dense features, yet your >>>>>>>>>>> model >>>>>>>>>>> has 14 dense features. >>>>>>>>>>> >>>>>>>>>>> The string "1 1 1" appears directly after the language model >>>>>>>>>>> feature in >>>>>>>>>>> line 1694, in your attachment, adding the extra 3 features. Note >>>>>>>>>>> that >>>>>>>>>>> this is not the line you mentioned in your earlier email. >>>>>>>>>>> >>>>>>>>>>> I have no idea why there are extra features. Have you made >>>>>>>>>>> changes to >>>>>>>>>>> any of the core Moses features? >>>>>>>>>>> >>>>>>>>>>> best wishes >>>>>>>>>>> Barry >>>>>>>>>>> >>>>>>>>>>> The offending line: >>>>>>>>>>> what(): Error in line "-5.44027 0 0 -5.34901 0 0 0 -224.872 1 1 >>>>>>>>>>> 1 -39 >>>>>>>>>>> 18 -26.2331 -40.6736 -44.3698 -82.5072 WT_,~,=3 WT_:~:=1 >>>>>>>>>>> WT_“~“=1 >>>>>>>>>>> WT_”~”=1 WT_曰~说=1 PL_s3=5 PL_3,2=2 PL_3,3=3 PL_2,3=4 PL_t3=7 >>>>>>>>>>> PL_s1=5 >>>>>>>>>>> PL_1,2=2 PL_1,1=3 PL_t1=4 PL_2,2=3 PL_t2=7 PL_s2=8 PL_2,1=1 WT_ >>>>>>>>>>> 有~有=1 >>>>>>>>>>> WT_!~!=1 WT_其~的=1 WT_其~他=1 WT_不~也=1 WT_不~没=1 WT_而~ >>>>>>>>>>> 而=1 >>>>>>>>>>> WT_而~ >>>>>>>>>>> 却=1 WT_祖逖~逖=1 WT_祖逖~祖=1 WT_逖~祖=1 WT_逖~逖=1 WT_大~大 >>>>>>>>>>> 江=1 >>>>>>>>>>> WT_者~ >>>>>>>>>>> 的=1 WT_者~人=1 WT_江~大江=1 WT_渡~渡过=1 WT_复~又=1 WT_余~有=1 >>>>>>>>>>> WT_ >>>>>>>>>>> 誓~发 >>>>>>>>>>> 誓=1 WT_楫~木=1 WT_江~长江=1 WT_击~击=1 WT_将~带领=1 WT_济~成 >>>>>>>>>>> 功=1 >>>>>>>>>>> WT_中 >>>>>>>>>>> 原~中原=1 WT_清~廓清=1 WT_如~像=1 WT_楫~戢=1 WT_能~能=1 WT_中~中 >>>>>>>>>>> 流=1 WT_ >>>>>>>>>>> 流~中流=1 WT_部曲~部下=1 " of ... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 18/01/16 10:37, Dingyuan Wang wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I've attached that. The line number is 1694. >>>>>>>>>>>> >>>>>>>>>>>> 在 2016年01月18日 16:43, Barry Haddow 写道: >>>>>>>>>>>>> Hi Dingyuan >>>>>>>>>>>>> >>>>>>>>>>>>> Is it possible to attach the features.dat file that is >>>>>>>>>>>>> causing the >>>>>>>>>>>>> error? Almost certainly Moses is failing to parse the line >>>>>>>>>>>>> because of >>>>>>>>>>>>> the Asian characters in the feature names, >>>>>>>>>>>>> >>>>>>>>>>>>> cheers - Barry >>>>>>>>>>>>> >>>>>>>>>>>>> On 16/01/16 15:58, Dingyuan Wang wrote: >>>>>>>>>>>>>> I ran >>>>>>>>>>>>>> >>>>>>>>>>>>>> ~/software/moses/bin/kbmira -J 75 --dense-init run7.dense >>>>>>>>>>>>>> --sparse-init >>>>>>>>>>>>>> run7.sparse-weights --ffile run1.features.dat --ffile >>>>>>>>>>>>>> run2.features.dat >>>>>>>>>>>>>> --ffile run3.features.dat --ffile run4.features.dat --ffile >>>>>>>>>>>>>> run5.features.dat --ffile run6.features.dat --ffile >>>>>>>>>>>>>> run7.features.dat >>>>>>>>>>>>>> --scfile run1.scores.dat --scfile run2.scores.dat --scfile >>>>>>>>>>>>>> run3.scores.dat --scfile run4.scores.dat --scfile >>>>>>>>>>>>>> run5.scores.dat >>>>>>>>>>>>>> --scfile run6.scores.dat --scfile run7.scores.dat -o >>>>>>>>>>>>>> /tmp/mert.out >>>>>>>>>>>>>> >>>>>>>>>>>>>> in the tuning/tmp.1 directory, which will certainly >>>>>>>>>>>>>> replicate the >>>>>>>>>>>>>> error. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 在 2016年01月16日 23:42, Hieu Hoang 写道: >>>>>>>>>>>>>>> The mert script prints out every command it runs. You >>>>>>>>>>>>>>> should be >>>>>>>>>>>>>>> able to >>>>>>>>>>>>>>> replicate the error by running the last command >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 16 Jan 2016 14:18, "Dingyuan Wang" <[email protected] >>>>>>>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sorry, but I can't reliably replicate the same >>>>>>>>>>>>>>> problem >>>>>>>>>>>>>>> when >>>>>>>>>>>>>>> running >>>>>>>>>>>>>>> TUNING_tune.1 alone. There is no character '_' in >>>>>>>>>>>>>>> the test >>>>>>>>>>>>>>> set >>>>>>>>>>>>>>> or top50 >>>>>>>>>>>>>>> list. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm using sparse-features = >>>>>>>>>>>>>>> "target-word-insertion >>>>>>>>>>>>>>> top 50, >>>>>>>>>>>>>>> source-word-deletion top 50, word-translation >>>>>>>>>>>>>>> top 50 >>>>>>>>>>>>>>> 50, >>>>>>>>>>>>>>> phrase-length" >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've attached some related files from EMS and the >>>>>>>>>>>>>>> EMS >>>>>>>>>>>>>>> config. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://mega.nz/#!xs0SFKxL!M_RTBp1JGX24-b4xlYYLP-bLXKiC_Sl-p96x55avAB4 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 在 2016年01月16日 02:45, Hieu Hoang 写道: >>>>>>>>>>>>>>> > could you make your model files available for >>>>>>>>>>>>>>> download so I >>>>>>>>>>>>>>> can >>>>>>>>>>>>>>> > replicate this problem. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > it seems like you're using a feature >>>>>>>>>>>>>>> function with >>>>>>>>>>>>>>> sparse >>>>>>>>>>>>>>> scores. I >>>>>>>>>>>>>>> > think the character '_' must be escaped. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > On 12/01/16 04:00, Dingyuan Wang wrote: >>>>>>>>>>>>>>> >> Hi all, >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> I'm using EMS for doing experiments. Every >>>>>>>>>>>>>>> time the >>>>>>>>>>>>>>> kbmira >>>>>>>>>>>>>>> died with >>>>>>>>>>>>>>> >> SIGABRT when turning on one direction, while >>>>>>>>>>>>>>> tuning >>>>>>>>>>>>>>> on the >>>>>>>>>>>>>>> opposite >>>>>>>>>>>>>>> >> direction (same config and test set) was >>>>>>>>>>>>>>> successful. >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> The mert.log (stderr) shows follows: >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> kbmira with c=0.01 decay=0.999 no_shuffle=0 >>>>>>>>>>>>>>> >> Initialising random seed from system clock >>>>>>>>>>>>>>> >> Found 15323 initial sparse features >>>>>>>>>>>>>>> >> ....terminate called after throwing an >>>>>>>>>>>>>>> instance of >>>>>>>>>>>>>>> >> 'MosesTuning::FileFormatException' >>>>>>>>>>>>>>> >> what(): Error in line "-4.51933 0 0 >>>>>>>>>>>>>>> -6.09733 >>>>>>>>>>>>>>> 0 0 0 >>>>>>>>>>>>>>> -121.556 2 >>>>>>>>>>>>>>> -20 12 >>>>>>>>>>>>>>> >> -31.6201 -38.5211 -26.5112 -60.6166 WT_,~,=2 >>>>>>>>>>>>>>> WT_?~?=1 >>>>>>>>>>>>>>> PL_s1=4 >>>>>>>>>>>>>>> >> PL_s3=1 PL_3,3=1 PL_2,2=3 PL_1,2=1 PL_2,1=3 >>>>>>>>>>>>>>> PL_t1=6 >>>>>>>>>>>>>>> PL_t2=4 >>>>>>>>>>>>>>> PL_t3=2 >>>>>>>>>>>>>>> >> PL_2,3=1 PL_s2=7 PL_1,1=3 WT_未~没有=1 WT_何~ >>>>>>>>>>>>>>> 怎么=1 >>>>>>>>>>>>>>> WT_何~ >>>>>>>>>>>>>>> 能=1 >>>>>>>>>>>>>>> WT_方~正 >>>>>>>>>>>>>>> >> 在=1 WT_又~还=1 WT_君~您=2 WT_趣~向=1 WT_ >>>>>>>>>>>>>>> 趣~奔=1 >>>>>>>>>>>>>>> WT_有~ >>>>>>>>>>>>>>> 没有=1 >>>>>>>>>>>>>>> WT_ >>>>>>>>>>>>>>> 往~去=1 >>>>>>>>>>>>>>> >> WT_官~官员=1 WT_假~借=1 WT_檄~檄文=1 WT_文~ >>>>>>>>>>>>>>> 文告=1 >>>>>>>>>>>>>>> WT_上~上 >>>>>>>>>>>>>>> 级=1 WT_为~ >>>>>>>>>>>>>>> >> 呢=1 WT_在~正在=1 " of run7.features.dat >>>>>>>>>>>>>>> >> Aborted >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> I think since run7.scores.dat is generated by >>>>>>>>>>>>>>> some >>>>>>>>>>>>>>> scripts, I >>>>>>>>>>>>>>> wouldn't >>>>>>>>>>>>>>> >> be responsible for making the bad format. Last >>>>>>>>>>>>>>> time it >>>>>>>>>>>>>>> also >>>>>>>>>>>>>>> died, I >>>>>>>>>>>>>>> >> removed the likely offending line in the test >>>>>>>>>>>>>>> set, but >>>>>>>>>>>>>>> this time >>>>>>>>>>>>>>> another >>>>>>>>>>>>>>> >> line appears. >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> -- >>>>>>>>>>>>>>> >> Dingyuan Wang >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> >> Moses-support mailing list >>>>>>>>>>>>>>> >> [email protected] >>>>>>>>>>>>>>> <mailto:[email protected]> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Dingyuan Wang (gumblex) >>>>>>>>>>>>>>> >>> > > -- Dingyuan Wang (gumblex) _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
