OK, I think the mystery is solved. The text version does not contain
alignment information. The standard algorithm for the compact phrase
table requires alignment information to work properly.
If alignments are not present, you should use the "-encoding None
-no-alignment-info" options (bigger, but still quite compact). It's even
mentioned in the documentation, but I think I should add a test to the
binarization tool, that croaks if alignment data is missing. A test with
your phrase table and "-encoding None -no-alignment-info" works fine and
produces the correct translation now. This also explains why the
compression was so cruelly slow and the results is even smaller than the
incorrectly built one. You wrote you used a version from July 2012 for
training, with a recent moses version, this issue would not have arisen.
Included alignment is now standard in the training scripts and then you
can use the standard procedure for compact binarization, this should
save some additional 30%.
BTW: New moses is very verbose, is this on purpose?
Best,
Marcin
W dniu 04.07.2013 22:01, Marcin Junczys-Dowmunt pisze:
The binary format in the main branch actually never changed from the
moment I released it. So it should not be an issue of binary
incompatibility. I am planning to add version numbers with the first
change in the binary This format other than versioning itself :)
W dniu 04.07.2013 21:56, Hieu Hoang pisze:
does your binary files have version numbers embedded in them? I would
highly recommend they do.
kenlm has it, it's even human readable by doing
head -1
on any kenlm binary files. The decoder throws errors if running with
incompatible version
If
On 4 July 2013 20:52, Marcin Junczys-Dowmunt <[email protected]
<mailto:[email protected]>> wrote:
I had a similar issue like that a few days ago with a quite old
moses version, recompiling and rebuilding the phrase table seemed
to solve it, so I did not investigate. However I am not quite
sure what I actually did to fix it. Currently I am building the
binary phrase table from the text version to compare. This will
take a while, more fun tomorrow.
W dniu 04.07.2013 21:46, Hieu Hoang pisze:
it's a bit strange. Many words are unknown in the compact-pt
version, eg. this 1 word sentence is unknown:
un
could it be encoding issues? or the wrong phrase table was
binarized?
On 4 July 2013 18:14, Hieu Hoang <[email protected]
<mailto:[email protected]>> wrote:
u can download my version
http://statmt.org/~s0565741/download/alex/
<http://statmt.org/%7Es0565741/download/alex/>
I've also filtered the text phrase table so that it can run
On 4 July 2013 17:47, Marcin Junczys-Dowmunt
<[email protected] <mailto:[email protected]>> wrote:
Hi Alexander,
I am able to log in, but then it hangs infinitly while
trying to retrieve the directory list.
Best,
Marcin
W dniu 04.07.2013 16:59, Fishkov, Alexander pisze:
Hi Hieu and Marcin!
>> If either if you have a model (no matter how big)
that reproduces the problem, that i can download, I
look into it
I have setup an ftp to share the model, so I send this
message in private (not to the mailing list).
ftp://hoang:[email protected]/
<ftp://hoang:moses%[email protected]/>
The folder structure is as follows:
/lm – contains binary language model (just in case)
/model.fr-en – contains translation model in text
format with moses.ini file
/compact-model.fr-en – contains compact model produced
from the previous one with moses.ini
P.S. I will be out of office until 16 of July.
Best regards, Alexander.
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support