Hi Dingyuan

Are these encoding errors present in your phrase table? Are they present 
in your training corpus? Since they appear in the word translation 
features, and you are using a shortlist, are they in the shortlist files 
in the model directory? (These have names with "topn" in them afaik).

File-system errors are unlikely, and for the most part Moses treats text 
as byte strings so encoding errors usually trace back to the source text.

cheers - Barry

On 18/01/16 14:56, Dingyuan Wang wrote:
> Hi Barry,
>
> "The ones starting with the "@"" are due to corrupted bytes in the nbest
> list.
>
> This kind of corruption occurs from time to time. I wonder if it comes
> from memory errors or filesystem failure or some kind of
> pointer/encoding problem in moses.
>
> I've written a script to find such corrupted lines:
>
> https://gist.github.com/gumblex/0d9d0848b435e4f9818f
>
> 在 2016年01月18日 20:42, Barry Haddow 写道:
>> Hi Dingyuan
>>
>> The extractor expects feature names to contain an underscore (not sure
>> exactly why) but some of yours don't, and Moses skips them, interpreting
>> their values as extra dense features.
>>
>> The attached screenshot shows my view of the offending names. The ones
>> starting with the "@" are the problem. So it does look like the nbest
>> list is corrupted. Can you run the decoder on just that sentence, to
>> create an uncompressed version of the nbest list?
>>
>> cheers - Barry
>>
>> On 18/01/16 12:02, Dingyuan Wang wrote:
>>> Hi Barry,
>>>
>>> Attached is the zgrep result.
>>> I found that in the middle of line 61 a few bytes are corrupted. Is that
>>> a moses problem or my memory has a problem?
>>>
>>> I also checked other files using iconv, they are all OK in UTF-8.
>>>
>>> 在 2016年01月18日 19:32, Barry Haddow 写道:
>>>> Hi Dingyuan
>>>>
>>>> Yes, that's very possible. The error could be in extracting features.dat
>>>> from the nbest list. Are you able to post the nbest list? Or at least
>>>> the entries for sentence 16?
>>>>
>>>> Run something like
>>>>
>>>> zgrep "^16 " tuning/tmp.1/run7.best100.out.gz
>>>>
>>>> cheers - Barry
>>>>
>>>> On 18/01/16 11:24, Dingyuan Wang wrote:
>>>>> Hi Barry,
>>>>>
>>>>> I have rerun the ems after the first email, and then posted the recent
>>>>> results, so the line changed.
>>>>>
>>>>> I just use the latest code, and the EMS script. Pretty much are default
>>>>> settings. The EMS setting is:
>>>>>
>>>>> sparse-features = "target-word-insertion top 50, source-word-deletion
>>>>> top 50, word-translation top 50 50, phrase-length"
>>>>>
>>>>> I suspect there is something unexpected in the extractor.
>>>>>
>>>>>
>>>>> 在 2016年01月18日 19:03, Barry Haddow 写道:
>>>>>> Hi Dingyuan
>>>>>>
>>>>>> In fact it is not the sparse features nor the Asian characters that
>>>>>> are
>>>>>> the problem. The offending line has 17 dense features, yet your model
>>>>>> has 14 dense features.
>>>>>>
>>>>>> The string "1 1 1" appears directly after the language model
>>>>>> feature in
>>>>>> line 1694, in your attachment, adding the extra 3 features. Note that
>>>>>> this is not the line you mentioned in your earlier email.
>>>>>>
>>>>>> I have no idea why there are extra features. Have you made changes to
>>>>>> any of the core Moses features?
>>>>>>
>>>>>> best wishes
>>>>>> Barry
>>>>>>
>>>>>> The offending line:
>>>>>> what():  Error in line "-5.44027 0 0 -5.34901 0 0 0 -224.872 1 1 1 -39
>>>>>> 18 -26.2331 -40.6736 -44.3698 -82.5072 WT_,~,=3 WT_:~:=1 WT_“~“=1
>>>>>> WT_”~”=1 WT_曰~说=1 PL_s3=5 PL_3,2=2 PL_3,3=3 PL_2,3=4 PL_t3=7 PL_s1=5
>>>>>> PL_1,2=2 PL_1,1=3 PL_t1=4 PL_2,2=3 PL_t2=7 PL_s2=8 PL_2,1=1 WT_有~有=1
>>>>>> WT_!~!=1 WT_其~的=1 WT_其~他=1 WT_不~也=1 WT_不~没=1 WT_而~而=1
>>>>>> WT_而~
>>>>>> 却=1 WT_祖逖~逖=1 WT_祖逖~祖=1 WT_逖~祖=1 WT_逖~逖=1 WT_大~大江=1
>>>>>> WT_者~
>>>>>> 的=1 WT_者~人=1 WT_江~大江=1 WT_渡~渡过=1 WT_复~又=1 WT_余~有=1 WT_
>>>>>> 誓~发
>>>>>> 誓=1 WT_楫~木=1 WT_江~长江=1 WT_击~击=1 WT_将~带领=1 WT_济~成功=1
>>>>>> WT_中
>>>>>> 原~中原=1 WT_清~廓清=1 WT_如~像=1 WT_楫~戢=1 WT_能~能=1 WT_中~中
>>>>>> 流=1 WT_
>>>>>> 流~中流=1 WT_部曲~部下=1 " of ...
>>>>>>
>>>>>>
>>>>>> On 18/01/16 10:37, Dingyuan Wang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've attached that. The line number is 1694.
>>>>>>>
>>>>>>> 在 2016年01月18日 16:43, Barry Haddow 写道:
>>>>>>>> Hi Dingyuan
>>>>>>>>
>>>>>>>> Is it possible to attach the features.dat file that is causing the
>>>>>>>> error? Almost certainly Moses is failing to parse the line
>>>>>>>> because of
>>>>>>>> the Asian characters in the feature names,
>>>>>>>>
>>>>>>>> cheers - Barry
>>>>>>>>
>>>>>>>> On 16/01/16 15:58, Dingyuan Wang wrote:
>>>>>>>>> I ran
>>>>>>>>>
>>>>>>>>> ~/software/moses/bin/kbmira -J 75  --dense-init run7.dense
>>>>>>>>> --sparse-init
>>>>>>>>> run7.sparse-weights  --ffile run1.features.dat --ffile
>>>>>>>>> run2.features.dat
>>>>>>>>> --ffile run3.features.dat --ffile run4.features.dat --ffile
>>>>>>>>> run5.features.dat --ffile run6.features.dat --ffile
>>>>>>>>> run7.features.dat
>>>>>>>>> --scfile run1.scores.dat --scfile run2.scores.dat --scfile
>>>>>>>>> run3.scores.dat --scfile run4.scores.dat --scfile run5.scores.dat
>>>>>>>>> --scfile run6.scores.dat --scfile run7.scores.dat -o /tmp/mert.out
>>>>>>>>>
>>>>>>>>> in the tuning/tmp.1 directory, which will certainly replicate the
>>>>>>>>> error.
>>>>>>>>>
>>>>>>>>> 在 2016年01月16日 23:42, Hieu Hoang 写道:
>>>>>>>>>> The mert script prints out every command it runs. You should be
>>>>>>>>>> able to
>>>>>>>>>> replicate the error by running the last command
>>>>>>>>>>
>>>>>>>>>> On 16 Jan 2016 14:18, "Dingyuan Wang" <abcdoyle...@gmail.com
>>>>>>>>>> <mailto:abcdoyle...@gmail.com>> wrote:
>>>>>>>>>>
>>>>>>>>>>          Sorry, but I can't reliably replicate the same problem
>>>>>>>>>> when
>>>>>>>>>> running
>>>>>>>>>>          TUNING_tune.1 alone. There is no character '_' in the test
>>>>>>>>>> set
>>>>>>>>>> or top50
>>>>>>>>>>          list.
>>>>>>>>>>
>>>>>>>>>>          I'm using sparse-features = "target-word-insertion top 50,
>>>>>>>>>>          source-word-deletion top 50, word-translation top 50 50,
>>>>>>>>>> phrase-length"
>>>>>>>>>>
>>>>>>>>>>          I've attached some related files from EMS and the EMS
>>>>>>>>>> config.
>>>>>>>>>>
>>>>>>>>>>      
>>>>>>>>>> https://mega.nz/#!xs0SFKxL!M_RTBp1JGX24-b4xlYYLP-bLXKiC_Sl-p96x55avAB4
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>          在 2016年01月16日 02:45, Hieu Hoang 写道:
>>>>>>>>>>          > could you make your model files available for
>>>>>>>>>> download so I
>>>>>>>>>> can
>>>>>>>>>>          > replicate this problem.
>>>>>>>>>>          >
>>>>>>>>>>          > it seems like you're using a feature function with
>>>>>>>>>> sparse
>>>>>>>>>> scores. I
>>>>>>>>>>          > think the character '_' must be escaped.
>>>>>>>>>>          >
>>>>>>>>>>          >
>>>>>>>>>>          > On 12/01/16 04:00, Dingyuan Wang wrote:
>>>>>>>>>>          >> Hi all,
>>>>>>>>>>          >>
>>>>>>>>>>          >> I'm using EMS for doing experiments. Every time the
>>>>>>>>>> kbmira
>>>>>>>>>> died with
>>>>>>>>>>          >> SIGABRT when turning on one direction, while tuning
>>>>>>>>>> on the
>>>>>>>>>> opposite
>>>>>>>>>>          >> direction (same config and test set) was successful.
>>>>>>>>>>          >>
>>>>>>>>>>          >> The mert.log (stderr) shows follows:
>>>>>>>>>>          >>
>>>>>>>>>>          >>
>>>>>>>>>>          >> kbmira with c=0.01 decay=0.999 no_shuffle=0
>>>>>>>>>>          >> Initialising random seed from system clock
>>>>>>>>>>          >> Found 15323 initial sparse features
>>>>>>>>>>          >> ....terminate called after throwing an instance of
>>>>>>>>>>          >> 'MosesTuning::FileFormatException'
>>>>>>>>>>          >>    what():  Error in line "-4.51933 0 0 -6.09733 0 0 0
>>>>>>>>>> -121.556 2
>>>>>>>>>>          -20 12
>>>>>>>>>>          >> -31.6201 -38.5211 -26.5112 -60.6166 WT_,~,=2
>>>>>>>>>> WT_?~?=1
>>>>>>>>>> PL_s1=4
>>>>>>>>>>          >> PL_s3=1 PL_3,3=1 PL_2,2=3 PL_1,2=1 PL_2,1=3 PL_t1=6
>>>>>>>>>> PL_t2=4
>>>>>>>>>> PL_t3=2
>>>>>>>>>>          >> PL_2,3=1 PL_s2=7 PL_1,1=3 WT_未~没有=1 WT_何~怎么=1
>>>>>>>>>> WT_何~
>>>>>>>>>> 能=1
>>>>>>>>>>          WT_方~正
>>>>>>>>>>          >> 在=1 WT_又~还=1 WT_君~您=2 WT_趣~向=1 WT_趣~奔=1 WT_有~
>>>>>>>>>> 没有=1
>>>>>>>>>> WT_
>>>>>>>>>>          往~去=1
>>>>>>>>>>          >> WT_官~官员=1 WT_假~借=1 WT_檄~檄文=1 WT_文~文告=1
>>>>>>>>>> WT_上~上
>>>>>>>>>> 级=1 WT_为~
>>>>>>>>>>          >> 呢=1 WT_在~正在=1 " of run7.features.dat
>>>>>>>>>>          >> Aborted
>>>>>>>>>>          >>
>>>>>>>>>>          >>
>>>>>>>>>>          >> I think since run7.scores.dat is generated by some
>>>>>>>>>> scripts, I
>>>>>>>>>>          wouldn't
>>>>>>>>>>          >> be responsible for making the bad format. Last time it
>>>>>>>>>> also
>>>>>>>>>> died, I
>>>>>>>>>>          >> removed the likely offending line in the test set, but
>>>>>>>>>> this time
>>>>>>>>>>          another
>>>>>>>>>>          >> line appears.
>>>>>>>>>>          >>
>>>>>>>>>>          >> --
>>>>>>>>>>          >> Dingyuan Wang
>>>>>>>>>>          >> _______________________________________________
>>>>>>>>>>          >> Moses-support mailing list
>>>>>>>>>>          >> Moses-support@mit.edu <mailto:Moses-support@mit.edu>
>>>>>>>>>>          >> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>>>          >
>>>>>>>>>>
>>>>>>>>>>          --
>>>>>>>>>>          Dingyuan Wang (gumblex)
>>>>>>>>>>
>>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to