Re: [Moses-support] regenerate-makefiles.sh

2011-04-14 Thread John Burger
Barry Haddow wrote:

 You need a more recent version of autoconf

 http://comments.gmane.org/gmane.comp.nlp.moses.user/3978

Ha, just spent an hour on this myself.  Can someone add a note to that  
effect at the top of regenerate-makefiles.sh?  It currently only says  
this:

# NOTE:
# Versions 1.9 (or higher) of aclocal and automake are required.

Adding the following line would be great:

# Version 1.6 (or higher) of autoconf is required

- JB

 On Thursday 14 April 2011 00:06, Javier Murillo wrote:
 Hi all,

 I'm trying to build Moses and get the following errors from
 regenerate-makefiles.sh. I end up with a huge configure file (20,000+
 lines) that seems to be garbled. I will appreciate if anybody has run
 across the same type of errors and can help with ideas on what to  
 do to fix
 them. Thank you and regards,
 Javier

 configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_GREP is required by...
 m4/boost.m4:215: BOOST_REQUIRE is expanded from...
 configure.in:130: the top level
 configure.in:130: warning: AC_PROG_SED is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_SED is required by...
 autoconf/general.m4:1799: AC_CACHE_VAL is expanded from...
 autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from...
 Calling /usr/bin/autoconf...
 configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_GREP is required by...
 m4/boost.m4:215: BOOST_REQUIRE is expanded from...
 configure.in:130: the top level
 configure.in:130: warning: AC_PROG_SED is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_SED is required by...
 autoconf/general.m4:1799: AC_CACHE_VAL is expanded from...
 autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from...
 configure:466: error: possibly undefined macro: BOOST_THREAD_LDFLAGS
  If this token and others are legitimate, please use  
 m4_pattern_allow.
  See the Autoconf documentation.
 configure:466: error: possibly undefined macro: BOOST_CPPFLAGS
 configure:466: error: possibly undefined macro: BOOST_ROOT
 configure:466: error: possibly undefined macro: BOOST_THREAD_LIBS
 configure:19224: error: possibly undefined macro: AC_PROG_GREP
 configure:19226: error: possibly undefined macro: AC_PROG_SED
 configure:20069: error: possibly undefined macro: _AS_ECHO_LOG
 configure:20070: error: possibly undefined macro: _AC_DO_STDERR
 autoconf failed

 Javier Murillo Lopez
 Weather Decision Technologies, Inc.
 201 David L. Boren Blvd, Ste 270
 Norman, OK 73072
 Ph: (405) 579-7675 Ext 243

 [wdt]http://www.wdtinc.com/[iMap_logo]
 2011 American Meteorological Society Award for
 Outstanding Services to Meteorology by a Corporation

 -- 
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] regenerate-makefiles.sh

2011-04-14 Thread John Burger
Javier Murillo wrote:

 Thanks for your feedback. Although while I'm trying to have my IT  
 support update to the latest version, I wonder why 2.59 doesn't work  
 then.

 $ autoconf --version
 autoconf (GNU Autoconf) 2.59
 Written by David J. MacKenzie and Akim Demaille.

I'm in the same boat.  AC_PROG_GREP was apparently added in 2.60.  I  
tried just copying the AC_PROG_GREP and AC_PROG_SED defs from a 2.60  
autoconf installation (on a different machine) to the m4 subdir in  
moses, but it still doesn't work.  Now it complains about a bunch of  
BOOST stuff.  I think the Boost checks use deeper features of  
autoconf, which are not present in 2.59.

- JB

 Javier Murillo Lopez
 Weather Decision Technologies, Inc.
 201 David L. Boren Blvd, Ste 270
 Norman, OK 73072
 Ph: (405) 579-7675 Ext 243


 2011 American Meteorological Society Award for
 Outstanding Services to Meteorology by a Corporation


 -Original Message-
 From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu 
 ] On Behalf Of John Burger
 Sent: Thursday, April 14, 2011 9:55 AM
 To: Moses-support
 Subject: Re: [Moses-support] regenerate-makefiles.sh

 Barry Haddow wrote:

 You need a more recent version of autoconf

 http://comments.gmane.org/gmane.comp.nlp.moses.user/3978

 Ha, just spent an hour on this myself.  Can someone add a note to that
 effect at the top of regenerate-makefiles.sh?  It currently only says
 this:

 # NOTE:
 # Versions 1.9 (or higher) of aclocal and automake are required.

 Adding the following line would be great:

 # Version 1.6 (or higher) of autoconf is required

 - JB

 On Thursday 14 April 2011 00:06, Javier Murillo wrote:
 Hi all,

 I'm trying to build Moses and get the following errors from
 regenerate-makefiles.sh. I end up with a huge configure file  
 (20,000+
 lines) that seems to be garbled. I will appreciate if anybody has  
 run
 across the same type of errors and can help with ideas on what to
 do to fix
 them. Thank you and regards,
 Javier

 configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_GREP is required by...
 m4/boost.m4:215: BOOST_REQUIRE is expanded from...
 configure.in:130: the top level
 configure.in:130: warning: AC_PROG_SED is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_SED is required by...
 autoconf/general.m4:1799: AC_CACHE_VAL is expanded from...
 autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from...
 Calling /usr/bin/autoconf...
 configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_GREP is required by...
 m4/boost.m4:215: BOOST_REQUIRE is expanded from...
 configure.in:130: the top level
 configure.in:130: warning: AC_PROG_SED is m4_require'd but is not
 m4_defun'd configure.in:130: AC_PROG_SED is required by...
 autoconf/general.m4:1799: AC_CACHE_VAL is expanded from...
 autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from...
 configure:466: error: possibly undefined macro: BOOST_THREAD_LDFLAGS
 If this token and others are legitimate, please use
 m4_pattern_allow.
 See the Autoconf documentation.
 configure:466: error: possibly undefined macro: BOOST_CPPFLAGS
 configure:466: error: possibly undefined macro: BOOST_ROOT
 configure:466: error: possibly undefined macro: BOOST_THREAD_LIBS
 configure:19224: error: possibly undefined macro: AC_PROG_GREP
 configure:19226: error: possibly undefined macro: AC_PROG_SED
 configure:20069: error: possibly undefined macro: _AS_ECHO_LOG
 configure:20070: error: possibly undefined macro: _AC_DO_STDERR
 autoconf failed

 Javier Murillo Lopez
 Weather Decision Technologies, Inc.
 201 David L. Boren Blvd, Ste 270
 Norman, OK 73072
 Ph: (405) 579-7675 Ext 243

 [wdt]http://www.wdtinc.com/[iMap_logo]
 2011 American Meteorological Society Award for
 Outstanding Services to Meteorology by a Corporation

 -- 
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists

2011-03-24 Thread John Burger
Lane Schwartz wrote:

 I've examined the n-best lists, and it seems there are at least a  
 couple of interesting cases. In the simplest case, several  
 translations of a given sentence produce the exact same score, and  
 these tied translations appear in different order during different  
 runs. This is a bit odd, but [not] terribly worrisome. The stranger  
 case is when there are two different decoding runs, and for a given  
 sentence, there are translations that appear only in run A, and  
 different translations that only appear in run B.

Both these cases are relevant to something we've occasionally seen,  
which is non-determinism during =tuning=.  This is not surprising  
given the above, since tuning of course involves decoding.  It's hard  
to reproduce, but we have sometimes seen very different weights coming  
out of MERT for the exact same system configurations.  The problem  
here is that even very small differences in tuning can result in  
substantial differences in test results, because of how twitchy BLEU is.

Like many folks, we typically run MERT on a cluster.  This brings up  
another source of non-determinism we've theorized about.  Some of our  
clusters are heterogenous, and we've wondered if there might be minor  
differences in floating point behavior from machine to machine.  The  
assignment of different chunks of the tuning data to different  
machines is typically non-deterministic, so this might carry over to  
the actual weights that come out of MERT.

Does anyone know how robust the floating point usage in the decoder is  
under these circumstances?

Thanks.

- John Burger
   MITRE
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] multi-bleu.pl max

2011-02-22 Thread John Burger
John Morgan wrote:

 One more question:
 To get the individual 1, 2, 3, and 4 -gram scores you divide by a
 total number of corresponding ngrams.
 From reading the multi-bleu.perl code the total comes from ngrams in
 the hypothesis (I think).
 Do you want the total to come from the references or the hypothesis?

BLEU is a precision score, so it comes from the hypothesis.

- John Burger
   MITRE

 On 2/22/11, Loïc BARRAULT loic.barra...@lium.univ-lemans.fr wrote:
 Hi John,

 yes this is what we want. Consider the following :
 REF : the the the the
 HYP : the
 Choosing the max would give 4 unigram matches instead of only 1.

 Cheers,

 Loïc

 Le 22/02/11 01:44, John Morgan a écrit :
 Sorry for the empty message.
 The attached file has a segment of code that I think is choosing the
 minimum  ngram match count.
 Is this what you want for BLEU?
 Don't you want the max?
 Thanks,
 John




 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support

 --
 Loïc BARRAULT
 LIUM - Equipe LST
 Université du Maine




 -- 
 Regards,
 John J Morgan

 ___
 Moses-support mailing list
 Moses-support@mit.edu
 http://mailman.mit.edu/mailman/listinfo/moses-support


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Use of qsub array in moses-parallel.pl

2010-12-16 Thread John Burger
Chris Dyer wrote:

 Would it be possible to have some kind of flag that turns this on or  
 off?

+1, please.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Use of qsub array in moses-parallel.pl

2010-12-16 Thread John Burger
Lane Schwartz wrote:

 John,

 I assume you are saying that you like the current qsub submission  
 mechanism used by moses-parallel.pl, and would like any changes to  
 allow the script to keep working exactly how it is now. Is that  
 correct?

Yes - apologies for my new media terseness. (:

- JB

 On Thu, Dec 16, 2010 at 10:30 AM, John Burger j...@mitre.org wrote:
 Chris Dyer wrote:

 Would it be possible to have some kind of flag that turns this on or  
 off?

 +1, please.

 - John D. Burger
  MITRE




 -- 
 When a place gets crowded enough to require ID's, social collapse is  
 not
 far away.  It is time to go elsewhere.  The best thing about space  
 travel
 is that it made it possible to go elsewhere.
 -- R.A. Heinlein, Time Enough For Love

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Use of qsub array in moses-parallel.pl

2010-12-16 Thread John Burger
Lane Schwartz wrote:

 If you don't mind my asking, I'm curious as to why.

Fear of change. (:

Basically, I haven't used array jobs at all, and I'm not sure our  
installation is set up for them.  However, if everyone thinks that's  
highly unlikely, and array jobs are a very standard thing in SGE, then  
there's no need for a proliferation of config switches, and I withdraw  
my +1 cents.

- JB

 In the current script, for a job split N ways, moses-parallel.pl  
 creates N temporary bash scripts (each of which will call Moses on  
 one part of the data), then launches each of these scripts via a  
 separate invocation of qsub. This results in N unique qsub jobs,  
 each with its own job id.

 In my proposed change, moses-parallel.pl would create 1 temporary  
 bash script, then would launch this one script via one call to qsub.  
 The call to qsub would use the flag -t 1-N. This would result in N  
 qsub jobs, each of which would share a common parent task ID. (You  
 can still identify child jobs, since each array child task also has  
 its own child task, ranging from 1 to N.)

 Everything else would stay exactly as it is now. If there's a  
 legitimate reason to maintain both, then I'm open to doing so, but I  
 don't know any reason why the current method would be preferable to  
 the proposed method.

 Lane

 On Thu, Dec 16, 2010 at 1:16 PM, John Burger j...@mitre.org wrote:
 Lane Schwartz wrote:

 John,

 I assume you are saying that you like the current qsub submission  
 mechanism used by moses-parallel.pl, and would like any changes to  
 allow the script to keep working exactly how it is now. Is that  
 correct?

 Yes - apologies for my new media terseness. (:

 - JB


 On Thu, Dec 16, 2010 at 10:30 AM, John Burger j...@mitre.org wrote:
 Chris Dyer wrote:

 Would it be possible to have some kind of flag that turns this on or  
 off?

 +1, please.

 - John D. Burger
  MITRE
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Proposal to replace vertical bar as factor delimeter

2010-11-16 Thread John Burger
We have yet to use multiple factors, and long ago made our pipeline,  
err, pipe-proof.  I vote for Ondrej's amendment:



- default is non-factored input

- surely keep the --factorDelimiter (but make it clear that it
  does/does not apply also to the phrase, generation and reordering
  tables)

- keep the regular ASCII '|' as the default



- John D. Burger
  MITRE



smime.p7s
Description: S/MIME cryptographic signature
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Different scores with SRILM and IRSTLM

2010-10-29 Thread John Burger
Kenneth Heafield wrote:

 kenlm's query tool implicitly places s at the beginning. It doesn't
 appear in the output, but you can see the effect because the n-gram
 length after the is 2, not 1.

Does this happen when kenlm is called from Moses as well?

There seem to me to be many reasons not to do this:  How do you know  
whether full sentences are being translated?  What if the translation  
model already includes sentence boundary tokens?  (See my recent  
message about why this might be desirable)

But most importantly: How do you know whether the language model was  
trained that way?

- John Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Different scores with SRILM and IRSTLM

2010-10-28 Thread John Burger
Felipe Sánchez Martínez wrote:

 * Does SRILM introduces begin-of-sentence and end-of-sentence tokens
 during training?

Yes, by default I believe - see the -no-sos and no-eos switches.

 * and, during scoring (or decoding)?

I don't think Moses adds them - it can't know how you trained the LM.   
We add them ourselves, and tell SRILM not to add them.  (We get some  
small gain in BLEU by doing this, by the way.)

 * Does IRSTLM introduces begin-of-sentence and end-of-sentence tokens
 during scoring (or decoding)?

No, unless this has recently changed.

 if I introduce s and /s when scoring with IRSTLM I get a log  
 prob of
 -55.3099 (very similar to that of SRILM).

This makes sense, given the above.

Some of the remaining discrepancy might be explained by the fact that  
you trained the SRILM model with  Kneser-Ney discounting, while IRSTLM  
uses Witten-Bell by default.  This doesn't seem sufficient to  
completely explain the discrepancy, though.

- John D. Burger
   MITRE


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] wrong alignment

2010-09-24 Thread John Burger
musa ghurab wrote:

 I trained a system of Chinese-Arabic language, but many alignments  
 are wrong.
 The same thing to lexical model, where are many words are wrongly  
 aligned
 Here is an example of lexical model (lex.e2f):

The point of Moses is not to get good alignments, but to get good  
translation output.  The target language model will help the decoder  
to pick good translations, even if the translation probabilities that  
come out of the alignment do not appear to be ideal.  A great deal of  
research effort has been wasted (in my opinion) on getting better  
alignments, without actually achieving better translation.

Have you run the resulting models on a test set?  What was the score?   
How big is your language model?  More LM data is probably the easiest  
way to make up for what might appear to be poor alignments.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Problrm in Decoding step !

2010-09-20 Thread John Burger
Somayeh Bakhshaei wrote:

 How it is may that moses translate one sentences into  two  
 sentences ?!!
 This is what is happening in my test set.

Moses doesn't know what a sentence is.  Do you mean that your output  
has a period in the middle of the output sequence?  There's nothing  
special about the period as a token, and nothing to prevent Moses from  
emitting it somewhere other than the end of the output (except that  
the LM might make it unlikely to be followed by anything else).  You  
might find that your score goes up if you filter these out.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Handling unknown words in Moses

2010-08-09 Thread John Burger
Philipp Koehn wrote:

 this is not correct - LM cost is in the future cost estimate.
 Obviously, this is a rather low probability, depending
 on if the language model was trained with open or
 closed vocabulary.

And also whether the word is unknown to the LM or not, yes?  Typically  
there are many more words in the language model's vocabulary than in  
the phrase table.

 The reordering of unknown words does cause often some
 strange reordering, due to the fact that an unknown word
 creates an unknown context for following words, and some
 words may prefer more than others to appear in such an
 unknown context.

These issues suggest to me that there might be some gain in dividing  
unknown words into a number of different classes.  (I don't mean Moses  
would do this, but that it would be some sort of pre- and post- 
processing steps that swap real words for a few placeholder tokens.)   
This could be quite simple (UNK_NUM vs. UNK_ALPHA vs. UNK_MIXED) or a  
more sophisticated unsupervised statistical model.

Has anyone tried anything like this, specifically with Moses systems?

Thanks.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] forcing a translation with -xml-input flag

2010-07-06 Thread John Burger
OK, so just to be painfully clear, the five (by default) weights for  
the translation model are not used at all for a phrase from the XML  
markup, correct?  What about the distortion weights?

Thanks.

- John D. Burger
   MITRE

On Jul 6, 2010, at 12:23, Philipp Koehn wrote:

 Hi,

 by default, the translation model probabilities are set to 1,
 but you can specify a different value with prob, i.e.:
 xml translation=big shoe prob=0.5 Riesenstiefel /xml

 See moses/src/XmlOption.cpp for the code and
 http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc4
 for documentation.

 -phi

 On Tue, Jul 6, 2010 at 4:55 PM, John Burger j...@mitre.org wrote:
 Philipp Koehn wrote:

 there are different modes for trating the XML markup: either
 inclusive or
 exclusive. In both cases, the specified XML translation is added to
 the set
 of translation options that can be used by the decoder. In the
 exclusive case,
 all other translation options that cover the same input words are
 thrown out,
 so the decoder is forced to used the specified translation.

 The specified translation options is treated just like any other
 translation
 option: it is scored with the language model, etc.

 But where do all the other feature values come from, e.g., the ones
 usually found in the phrase table?  The XML markup allows for only a
 single probability - how is this combined with any LM scores?

 - John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] What is the use of the lm parameter in the model training stage?

2010-05-21 Thread John Burger
 the LM is used only to create a formally correct configuration file.
 You can simply set any NON EMPTY file, to complete the training  
 successfully.
 Of course you have to modify the configfile with your good LM before  
 translating

Or you could simply do something like this:

% echo FAKE  factored-corpus/surface.lm
% train-model.perl \
--corpus factored-corpus/proj-syndicate \
--root-dir unfactored \
--f de --e en \
--lm 0:3:factored-corpus/surface.lm:0

Then you don't have to change the config file later, and you can build  
the lm in parallel with the model.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Combine Berkeley Aligner and GIZA++ in training ?

2010-04-09 Thread John Burger
haithem afli wrote:

 I would like to combine multiple world alignment strategies , in  
 order to combine the output of Berkeley Aligner and GIZA++ in  
 training.
 Can anyone explain to me what can i do ?

I think a common approach is to run them both, then simply append the  
two versions of the aligned corpora before phrase extraction.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] training fails on 1.4million fr-en sentence pairs

2010-03-29 Thread John Burger
 C:\cygwin\home\moses\tools\bin\snt2cooc.out: *** fatal error - cmalloc
 would have returned NULL
 =

 Am I running short of RAM?

Yes - malloc is failing to get more memory.  FWIW, I run phrase  
extraction on a machine with 66G, but that's probably more than is  
necessary.  You could try extracting shorter phrases - I think the  
default is 7, so you could try this:

   train-factored-phrase-model.perl ... --max-phrase-length 4

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] alignment problem

2009-06-18 Thread John Burger
Catharine Oertel wrote:

 I have a huge problem aligning my source and target language and I
 would appreciate your advice very much.

 The sentence length ratio of my source and target language is in
 average about 9:1. So I have much more words in my source language
 than I do have in my target language. I found that the intersect
 alignment method is working much better for me than the grow-diag-
 final. However, I do not get satisfactory results which I assume has
 also to do with the occurrence of ERROR 2.

That is a fairly large ratio - if you tell us your language pair, we  
might have suggestions for different ways to cast the problem.

By ERROR 2, do you mean type II errors, that is, false negatives?

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] FW: lexical weighting

2009-05-26 Thread John Burger
Bertoldi and Federico (2009) tackle a related problem when combining  
multiple phrase tables:

http://www.aclweb.org/anthology/W/W09/W09-0432.pdf

They have to come up with phrase scores for entries that aren't in all  
of the base tables.  They infer smoothed estimates using lexical  
probabilities.  This may or may not be useful to you.

- John D Burger
   MITRE

On May 22, 2009, at 13:00, Sanne Korzec wrote:

 Hi,

 Thanks for the previous replies.

 I am re-estimating the phrase pair table and enriching it with new  
 phrases.
 The newly added phrases need values for prob, lw, inverse prob and  
 inverse
 lw.

 Sometimes phrases are added in my systems whose lexical weights are  
 unknown.
 For some the lw can be calculated, but for some it can not. For  
 reasons I
 won't explain. I need to make a decision what to do with these unknown
 values.

 I have considered setting them to a fixed number: e.g. 0.1 or 0.01  
 or even
 0.0001. I have however no clue what the impact of these values are.

 I was hoping someone could point me in the right direction. I would  
 like to
 make an educated guess on what this value should be, but I do not have
 enough experience with MT to do this.

 I know from the previous replies that the values from the score  
 vector are
 all multiplied together, after applying an exponential weight. But  
 it would
 also help if someone could give me or point me towards the exact  
 formula.

 Thanks in advance,
 Sanne


 -Original Message-
 From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of  
 Philipp
 Koehn
 Sent: vrijdag 8 mei 2009 15:29
 To: sa...@kortec.nl
 Subject: Re: [Moses-support] lexical weighting and inverse  
 probabilities

 Hi,

 there should not be any zeros in this table, because that will,
 as you write, lead to an overall zero probability.

 -phi

 On Fri, May 8, 2009 at 11:54 AM, Sanne Korzec sa...@kortec.nl wrote:
 Ok thanks.

 Does this mean that if one of these values is zero in the table,  
 one can
 leave the entry out? Multiplication gives a result of zero. Or does  
 the
 exponential weight compensate for this?

 Sanne

 -Original Message-
 From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of  
 Philipp
 Koehn
 Sent: donderdag 7 mei 2009 20:07
 To: sa...@kortec.nl
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] lexical weighting and inverse  
 probabilities

 Hi,

 they are all multiplied together, after applying an exponential  
 weight.

 -phi

 On Thu, May 7, 2009 at 4:51 PM, Sanne Korzec sa...@kortec.nl wrote:
 Hi,

 The final phrase pair table usually has a score vector of length 5:

 The components are: probability, lexical weights, inverse  
 probability,
 inverde lex. Weights and a constant.

 How and why are the lexical weights, the inverse probabilities and  
 the
 inverse lexical weighting exactly used during decoding?

 Sanne
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] processLexicalTable throws std::bad_alloc error

2009-05-19 Thread John Burger
Mirko Plitt wrote:

 To close the loop on this one, in case anyone else runs into this.

 Turns out the reordering table contained a handful offending lines  
 which triggered the abort:

 ^K ||| ^K ||| 0.818182 0.0909091 0.0909091 0.818182 0.0909091  
 0.0909091
 ^K ||| désactivés ||| 0.6 0.2 0.2 0.6 0.2 0.2
 ^K ||| en ||| 0.2 0.2 0.6 0.2 0.2 0.6
 ^K ||| la ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857

Based on recent experiences with corrupted data in the UN Chinese- 
English corpus, I now have something in my data prep pipeline that  
strips out any lines, on either side, with any ASCII control  
characters.  I do this in Python, but something like the following  
would work with Perl:

   perl -ne 'print m/[\000-\010\013\016-\037\177]/ ? \n : $_;'

(Control-K is \013.)  This replaces any lines containing such  
characters with an empty line.  I run the Python equivalent of this on  
both sides of my parallel data, separately.  Later, the clean-corpus- 
n.perl script in the Moses training pipeline strips out the entire  
pair, since one side has zero tokens.

Note that this works for ASCII or UTF8 data, but something else may be  
appropriate for other character encodings.

- John D. Burger
   MITRE


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] another lw question

2009-05-11 Thread John Burger
Sanne Korzec wrote:

 I have a question on lexical weighting from the paper: philipp  
 koehn, och, marcu. Statistical Phrase Based Translation. On page 5,  
 subsection 4.4 Lexical weighting, an example is given how to compute  
 lexical weights.

...

 But then, how can source word f2 be mapped to two target words?  
 Viterbi alignments only allow each source word to be mapped to one  
 target word. What’s going on here?

I haven't refreshed my memory of that paper, but I suspect these  
alignments are after symmetrization, where the Viterbi alignments from  
both directions are (heuristically) merged.  This often produces many- 
to-many alignments.

- John D. Burger
   MITRE


___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] another lw question

2009-05-11 Thread John Burger
I can't say, depends on what you're doing, I suppose.  I think running  
GIZA in both directions and then merging the alignments in some  
fashion is now widely accepted as The Right Thing To Do, at least in  
terms of translation performance.  Your mileage may vary for other  
pursuits.

- John D. Burger
   MITRE

On May 11, 2009, at 10:27, Sanne Korzec wrote:

 Ok thanks..

 I only have access to the giza produced viterbi alignments. Will it  
 distort
 my experiments much if I use these instead?

 Regards,
 Sanne


 -Original Message-
 From: John Burger [mailto:j...@mitre.org]
 Sent: maandag 11 mei 2009 15:27
 To: sa...@kortec.nl
 Cc: moses-support@mit.edu
 Subject: Re: [Moses-support] another lw question

 Sanne Korzec wrote:

 I have a question on lexical weighting from the paper: philipp
 koehn, och, marcu. Statistical Phrase Based Translation. On page 5,
 subsection 4.4 Lexical weighting, an example is given how to compute
 lexical weights.

 ...

 But then, how can source word f2 be mapped to two target words?
 Viterbi alignments only allow each source word to be mapped to one
 target word. What's going on here?

 I haven't refreshed my memory of that paper, but I suspect these
 alignments are after symmetrization, where the Viterbi alignments from
 both directions are (heuristically) merged.  This often produces many-
 to-many alignments.

 - John D. Burger
   MITRE
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Giza++ input tokens (templates)

2009-02-27 Thread John Burger
James Read wrote:

 Forgive me for my ignorance but what exactly is the problem with using
 Giza++ for n-gram alignment? A single word is just a string of
 letters. An n-gram is a string of letters with some spaces in between.
 Why should using Giza for aligning strings of letters with spaces in
 between be any different to aligning strings of letters? Is this just
 a problem of computation time and limited computational resources?

Ngrams are not simply words with spaces in them - ngrams =overlap=,  
while words do not.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Future costs calculation in MOSES

2009-02-18 Thread John Burger
Hieu Hoang wrote:

 i think you're asking why the unigram and bigram LM scores of the  
 1st two words are used to calculate future scores when the LM is a  
 trigram.

Just an aside - you're only talking about the LM used for the future  
score, correct?  The order of the main LM is whatever we build with  
SRILM or IRSTLM, etc.  I presume Moses doesn't even have to know many  
of the details of this LM, it just hands a partially generated output  
sequence to the LM library.

Another aside - assuming I'm correct above, where does the future  
score trigram LM come from?

Thanks.

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] Language model

2008-11-18 Thread John Burger
Michael Zuckerman wrote:

 Could you please explain about the format of .lm file generated by  
 the script ngram-count.

http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html

- John D. Burger
   MITRE

___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support


Re: [Moses-support] mert stop-continue error

2008-09-26 Thread John Burger
musa ghurab wrote:

 i was running mert-moses.pl and it was working fine, then i stoped  
 it after 30 hours of tunning, and then continue using the option -- 
 continue, after 50 hours i stop it again but at this time i couldn't  
 continue, i got the following error.
 Failed to find the step number, failed to read finished_step.txt at  
 training/mert-moses.pl line 436.

I don't have anything too concrete to say, but I have had a similar  
issue where MERT couldn't restart because it couldn't find the weights  
file.  I suspect that if you kill MERT just right, it can't recover,  
presumably because it is not doing atomic updates.  This only happened  
to me once, however.

- John Burger
   MITRE
___
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support