Hi Barry,

I've uploaded the model:
https://mega.nz/#!UsVSBCBJ!e5IATFvLqrCb5zhmDekLn8NOGw4PSD9RRQLGQeKEvNY

To test the model, I included a script 'repeatnbest.sh' which runs moses
repeatedly until encoding error occurs.

The file run7.best100.out and run7.out in the archive is the last run
that produces the error.

It seems that it is WordTranslationFeature that causes the problem.

在 2016年01月19日 00:03, Barry Haddow 写道:
> Hi Dingyuan
> 
> Something is going wrong with the construction or outputting of feature
> names, and it looks like it's WordTranslationFeature that's the problem.
> Does the problem go away if you do not use word translation features?
> 
> If you could make available a model that reproduces the nbest list
> construction then I would have a chance to debug it,
> 
> cheers - Barry
> 
> On 18/01/16 15:32, Dingyuan Wang wrote:
>> Hi Barry,
>>
>> I've checked all the models and corpora with the script, without finding
>> any encoding problem.
>>
>> I also find that all such errors in nbest list occurs only in the
>> feature list (3 different samples), without affecting translation
>> result. Therefore, the phrase table or training corpus may not be the
>> problem.
>>
>> 在 2016年01月18日 23:04, Barry Haddow 写道:
>>> Hi Dingyuan
>>>
>>> Are these encoding errors present in your phrase table? Are they present
>>> in your training corpus? Since they appear in the word translation
>>> features, and you are using a shortlist, are they in the shortlist files
>>> in the model directory? (These have names with "topn" in them afaik).
>>>
>>> File-system errors are unlikely, and for the most part Moses treats text
>>> as byte strings so encoding errors usually trace back to the source
>>> text.
>>>
>>> cheers - Barry
>>>
>>> On 18/01/16 14:56, Dingyuan Wang wrote:
>>>> Hi Barry,
>>>>
>>>> "The ones starting with the "@"" are due to corrupted bytes in the
>>>> nbest
>>>> list.
>>>>
>>>> This kind of corruption occurs from time to time. I wonder if it comes
>>>> from memory errors or filesystem failure or some kind of
>>>> pointer/encoding problem in moses.
>>>>
>>>> I've written a script to find such corrupted lines:
>>>>
>>>> https://gist.github.com/gumblex/0d9d0848b435e4f9818f
>>>>
>>>> 在 2016年01月18日 20:42, Barry Haddow 写道:
>>>>> Hi Dingyuan
>>>>>
>>>>> The extractor expects feature names to contain an underscore (not sure
>>>>> exactly why) but some of yours don't, and Moses skips them,
>>>>> interpreting
>>>>> their values as extra dense features.
>>>>>
>>>>> The attached screenshot shows my view of the offending names. The ones
>>>>> starting with the "@" are the problem. So it does look like the nbest
>>>>> list is corrupted. Can you run the decoder on just that sentence, to
>>>>> create an uncompressed version of the nbest list?
>>>>>
>>>>> cheers - Barry
>>>>>
>>>>> On 18/01/16 12:02, Dingyuan Wang wrote:
>>>>>> Hi Barry,
>>>>>>
>>>>>> Attached is the zgrep result.
>>>>>> I found that in the middle of line 61 a few bytes are corrupted. Is
>>>>>> that
>>>>>> a moses problem or my memory has a problem?
>>>>>>
>>>>>> I also checked other files using iconv, they are all OK in UTF-8.
>>>>>>
>>>>>> 在 2016年01月18日 19:32, Barry Haddow 写道:
>>>>>>> Hi Dingyuan
>>>>>>>
>>>>>>> Yes, that's very possible. The error could be in extracting
>>>>>>> features.dat
>>>>>>> from the nbest list. Are you able to post the nbest list? Or at
>>>>>>> least
>>>>>>> the entries for sentence 16?
>>>>>>>
>>>>>>> Run something like
>>>>>>>
>>>>>>> zgrep "^16 " tuning/tmp.1/run7.best100.out.gz
>>>>>>>
>>>>>>> cheers - Barry
>>>>>>>
>>>>>>> On 18/01/16 11:24, Dingyuan Wang wrote:
>>>>>>>> Hi Barry,
>>>>>>>>
>>>>>>>> I have rerun the ems after the first email, and then posted the
>>>>>>>> recent
>>>>>>>> results, so the line changed.
>>>>>>>>
>>>>>>>> I just use the latest code, and the EMS script. Pretty much are
>>>>>>>> default
>>>>>>>> settings. The EMS setting is:
>>>>>>>>
>>>>>>>> sparse-features = "target-word-insertion top 50,
>>>>>>>> source-word-deletion
>>>>>>>> top 50, word-translation top 50 50, phrase-length"
>>>>>>>>
>>>>>>>> I suspect there is something unexpected in the extractor.
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2016年01月18日 19:03, Barry Haddow 写道:
>>>>>>>>> Hi Dingyuan
>>>>>>>>>
>>>>>>>>> In fact it is not the sparse features nor the Asian characters
>>>>>>>>> that
>>>>>>>>> are
>>>>>>>>> the problem. The offending line has 17 dense features, yet your
>>>>>>>>> model
>>>>>>>>> has 14 dense features.
>>>>>>>>>
>>>>>>>>> The string "1 1 1" appears directly after the language model
>>>>>>>>> feature in
>>>>>>>>> line 1694, in your attachment, adding the extra 3 features. Note
>>>>>>>>> that
>>>>>>>>> this is not the line you mentioned in your earlier email.
>>>>>>>>>
>>>>>>>>> I have no idea why there are extra features. Have you made
>>>>>>>>> changes to
>>>>>>>>> any of the core Moses features?
>>>>>>>>>
>>>>>>>>> best wishes
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>>> The offending line:
>>>>>>>>> what():  Error in line "-5.44027 0 0 -5.34901 0 0 0 -224.872 1 1
>>>>>>>>> 1 -39
>>>>>>>>> 18 -26.2331 -40.6736 -44.3698 -82.5072 WT_,~,=3 WT_:~:=1
>>>>>>>>> WT_“~“=1
>>>>>>>>> WT_”~”=1 WT_曰~说=1 PL_s3=5 PL_3,2=2 PL_3,3=3 PL_2,3=4 PL_t3=7
>>>>>>>>> PL_s1=5
>>>>>>>>> PL_1,2=2 PL_1,1=3 PL_t1=4 PL_2,2=3 PL_t2=7 PL_s2=8 PL_2,1=1 WT_
>>>>>>>>> 有~有=1
>>>>>>>>> WT_!~!=1 WT_其~的=1 WT_其~他=1 WT_不~也=1 WT_不~没=1 WT_而~而=1
>>>>>>>>> WT_而~
>>>>>>>>> 却=1 WT_祖逖~逖=1 WT_祖逖~祖=1 WT_逖~祖=1 WT_逖~逖=1 WT_大~大江=1
>>>>>>>>> WT_者~
>>>>>>>>> 的=1 WT_者~人=1 WT_江~大江=1 WT_渡~渡过=1 WT_复~又=1 WT_余~有=1
>>>>>>>>> WT_
>>>>>>>>> 誓~发
>>>>>>>>> 誓=1 WT_楫~木=1 WT_江~长江=1 WT_击~击=1 WT_将~带领=1 WT_济~成功=1
>>>>>>>>> WT_中
>>>>>>>>> 原~中原=1 WT_清~廓清=1 WT_如~像=1 WT_楫~戢=1 WT_能~能=1 WT_中~中
>>>>>>>>> 流=1 WT_
>>>>>>>>> 流~中流=1 WT_部曲~部下=1 " of ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 18/01/16 10:37, Dingyuan Wang wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've attached that. The line number is 1694.
>>>>>>>>>>
>>>>>>>>>> 在 2016年01月18日 16:43, Barry Haddow 写道:
>>>>>>>>>>> Hi Dingyuan
>>>>>>>>>>>
>>>>>>>>>>> Is it possible to attach the features.dat file that is
>>>>>>>>>>> causing the
>>>>>>>>>>> error? Almost certainly Moses is failing to parse the line
>>>>>>>>>>> because of
>>>>>>>>>>> the Asian characters in the feature names,
>>>>>>>>>>>
>>>>>>>>>>> cheers - Barry
>>>>>>>>>>>
>>>>>>>>>>> On 16/01/16 15:58, Dingyuan Wang wrote:
>>>>>>>>>>>> I ran
>>>>>>>>>>>>
>>>>>>>>>>>> ~/software/moses/bin/kbmira -J 75  --dense-init run7.dense
>>>>>>>>>>>> --sparse-init
>>>>>>>>>>>> run7.sparse-weights  --ffile run1.features.dat --ffile
>>>>>>>>>>>> run2.features.dat
>>>>>>>>>>>> --ffile run3.features.dat --ffile run4.features.dat --ffile
>>>>>>>>>>>> run5.features.dat --ffile run6.features.dat --ffile
>>>>>>>>>>>> run7.features.dat
>>>>>>>>>>>> --scfile run1.scores.dat --scfile run2.scores.dat --scfile
>>>>>>>>>>>> run3.scores.dat --scfile run4.scores.dat --scfile
>>>>>>>>>>>> run5.scores.dat
>>>>>>>>>>>> --scfile run6.scores.dat --scfile run7.scores.dat -o
>>>>>>>>>>>> /tmp/mert.out
>>>>>>>>>>>>
>>>>>>>>>>>> in the tuning/tmp.1 directory, which will certainly
>>>>>>>>>>>> replicate the
>>>>>>>>>>>> error.
>>>>>>>>>>>>
>>>>>>>>>>>> 在 2016年01月16日 23:42, Hieu Hoang 写道:
>>>>>>>>>>>>> The mert script prints out every command it runs. You
>>>>>>>>>>>>> should be
>>>>>>>>>>>>> able to
>>>>>>>>>>>>> replicate the error by running the last command
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 16 Jan 2016 14:18, "Dingyuan Wang" <abcdoyle...@gmail.com
>>>>>>>>>>>>> <mailto:abcdoyle...@gmail.com>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>           Sorry, but I can't reliably replicate the same
>>>>>>>>>>>>> problem
>>>>>>>>>>>>> when
>>>>>>>>>>>>> running
>>>>>>>>>>>>>           TUNING_tune.1 alone. There is no character '_' in
>>>>>>>>>>>>> the test
>>>>>>>>>>>>> set
>>>>>>>>>>>>> or top50
>>>>>>>>>>>>>           list.
>>>>>>>>>>>>>
>>>>>>>>>>>>>           I'm using sparse-features = "target-word-insertion
>>>>>>>>>>>>> top 50,
>>>>>>>>>>>>>           source-word-deletion top 50, word-translation top 50
>>>>>>>>>>>>> 50,
>>>>>>>>>>>>> phrase-length"
>>>>>>>>>>>>>
>>>>>>>>>>>>>           I've attached some related files from EMS and the
>>>>>>>>>>>>> EMS
>>>>>>>>>>>>> config.
>>>>>>>>>>>>>
>>>>>>>>>>>>>     
>>>>>>>>>>>>> https://mega.nz/#!xs0SFKxL!M_RTBp1JGX24-b4xlYYLP-bLXKiC_Sl-p96x55avAB4
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>           在 2016年01月16日 02:45, Hieu Hoang 写道:
>>>>>>>>>>>>>           > could you make your model files available for
>>>>>>>>>>>>> download so I
>>>>>>>>>>>>> can
>>>>>>>>>>>>>           > replicate this problem.
>>>>>>>>>>>>>           >
>>>>>>>>>>>>>           > it seems like you're using a feature function with
>>>>>>>>>>>>> sparse
>>>>>>>>>>>>> scores. I
>>>>>>>>>>>>>           > think the character '_' must be escaped.
>>>>>>>>>>>>>           >
>>>>>>>>>>>>>           >
>>>>>>>>>>>>>           > On 12/01/16 04:00, Dingyuan Wang wrote:
>>>>>>>>>>>>>           >> Hi all,
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >> I'm using EMS for doing experiments. Every
>>>>>>>>>>>>> time the
>>>>>>>>>>>>> kbmira
>>>>>>>>>>>>> died with
>>>>>>>>>>>>>           >> SIGABRT when turning on one direction, while
>>>>>>>>>>>>> tuning
>>>>>>>>>>>>> on the
>>>>>>>>>>>>> opposite
>>>>>>>>>>>>>           >> direction (same config and test set) was
>>>>>>>>>>>>> successful.
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >> The mert.log (stderr) shows follows:
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >> kbmira with c=0.01 decay=0.999 no_shuffle=0
>>>>>>>>>>>>>           >> Initialising random seed from system clock
>>>>>>>>>>>>>           >> Found 15323 initial sparse features
>>>>>>>>>>>>>           >> ....terminate called after throwing an
>>>>>>>>>>>>> instance of
>>>>>>>>>>>>>           >> 'MosesTuning::FileFormatException'
>>>>>>>>>>>>>           >>    what():  Error in line "-4.51933 0 0 -6.09733
>>>>>>>>>>>>> 0 0 0
>>>>>>>>>>>>> -121.556 2
>>>>>>>>>>>>>           -20 12
>>>>>>>>>>>>>           >> -31.6201 -38.5211 -26.5112 -60.6166 WT_,~,=2
>>>>>>>>>>>>> WT_?~?=1
>>>>>>>>>>>>> PL_s1=4
>>>>>>>>>>>>>           >> PL_s3=1 PL_3,3=1 PL_2,2=3 PL_1,2=1 PL_2,1=3
>>>>>>>>>>>>> PL_t1=6
>>>>>>>>>>>>> PL_t2=4
>>>>>>>>>>>>> PL_t3=2
>>>>>>>>>>>>>           >> PL_2,3=1 PL_s2=7 PL_1,1=3 WT_未~没有=1 WT_何~
>>>>>>>>>>>>> 怎么=1
>>>>>>>>>>>>> WT_何~
>>>>>>>>>>>>> 能=1
>>>>>>>>>>>>>           WT_方~正
>>>>>>>>>>>>>           >> 在=1 WT_又~还=1 WT_君~您=2 WT_趣~向=1 WT_趣~奔=1
>>>>>>>>>>>>> WT_有~
>>>>>>>>>>>>> 没有=1
>>>>>>>>>>>>> WT_
>>>>>>>>>>>>>           往~去=1
>>>>>>>>>>>>>           >> WT_官~官员=1 WT_假~借=1 WT_檄~檄文=1 WT_文~文告=1
>>>>>>>>>>>>> WT_上~上
>>>>>>>>>>>>> 级=1 WT_为~
>>>>>>>>>>>>>           >> 呢=1 WT_在~正在=1 " of run7.features.dat
>>>>>>>>>>>>>           >> Aborted
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >> I think since run7.scores.dat is generated by
>>>>>>>>>>>>> some
>>>>>>>>>>>>> scripts, I
>>>>>>>>>>>>>           wouldn't
>>>>>>>>>>>>>           >> be responsible for making the bad format. Last
>>>>>>>>>>>>> time it
>>>>>>>>>>>>> also
>>>>>>>>>>>>> died, I
>>>>>>>>>>>>>           >> removed the likely offending line in the test
>>>>>>>>>>>>> set, but
>>>>>>>>>>>>> this time
>>>>>>>>>>>>>           another
>>>>>>>>>>>>>           >> line appears.
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>>           >> --
>>>>>>>>>>>>>           >> Dingyuan Wang
>>>>>>>>>>>>>           >> _______________________________________________
>>>>>>>>>>>>>           >> Moses-support mailing list
>>>>>>>>>>>>>           >> Moses-support@mit.edu
>>>>>>>>>>>>> <mailto:Moses-support@mit.edu>
>>>>>>>>>>>>>           >>
>>>>>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>>>>>>           >
>>>>>>>>>>>>>
>>>>>>>>>>>>>           --
>>>>>>>>>>>>>           Dingyuan Wang (gumblex)
>>>>>>>>>>>>>
>>>
> 
> 

-- 
Dingyuan Wang (gumblex)
_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to