Re: [Moses-support] regenerate-makefiles.sh
Barry Haddow wrote: You need a more recent version of autoconf http://comments.gmane.org/gmane.comp.nlp.moses.user/3978 Ha, just spent an hour on this myself. Can someone add a note to that effect at the top of regenerate-makefiles.sh? It currently only says this: # NOTE: # Versions 1.9 (or higher) of aclocal and automake are required. Adding the following line would be great: # Version 1.6 (or higher) of autoconf is required - JB On Thursday 14 April 2011 00:06, Javier Murillo wrote: Hi all, I'm trying to build Moses and get the following errors from regenerate-makefiles.sh. I end up with a huge configure file (20,000+ lines) that seems to be garbled. I will appreciate if anybody has run across the same type of errors and can help with ideas on what to do to fix them. Thank you and regards, Javier configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_GREP is required by... m4/boost.m4:215: BOOST_REQUIRE is expanded from... configure.in:130: the top level configure.in:130: warning: AC_PROG_SED is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_SED is required by... autoconf/general.m4:1799: AC_CACHE_VAL is expanded from... autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from... Calling /usr/bin/autoconf... configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_GREP is required by... m4/boost.m4:215: BOOST_REQUIRE is expanded from... configure.in:130: the top level configure.in:130: warning: AC_PROG_SED is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_SED is required by... autoconf/general.m4:1799: AC_CACHE_VAL is expanded from... autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from... configure:466: error: possibly undefined macro: BOOST_THREAD_LDFLAGS If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. configure:466: error: possibly undefined macro: BOOST_CPPFLAGS configure:466: error: possibly undefined macro: BOOST_ROOT configure:466: error: possibly undefined macro: BOOST_THREAD_LIBS configure:19224: error: possibly undefined macro: AC_PROG_GREP configure:19226: error: possibly undefined macro: AC_PROG_SED configure:20069: error: possibly undefined macro: _AS_ECHO_LOG configure:20070: error: possibly undefined macro: _AC_DO_STDERR autoconf failed Javier Murillo Lopez Weather Decision Technologies, Inc. 201 David L. Boren Blvd, Ste 270 Norman, OK 73072 Ph: (405) 579-7675 Ext 243 [wdt]http://www.wdtinc.com/[iMap_logo] 2011 American Meteorological Society Award for Outstanding Services to Meteorology by a Corporation -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] regenerate-makefiles.sh
Javier Murillo wrote: Thanks for your feedback. Although while I'm trying to have my IT support update to the latest version, I wonder why 2.59 doesn't work then. $ autoconf --version autoconf (GNU Autoconf) 2.59 Written by David J. MacKenzie and Akim Demaille. I'm in the same boat. AC_PROG_GREP was apparently added in 2.60. I tried just copying the AC_PROG_GREP and AC_PROG_SED defs from a 2.60 autoconf installation (on a different machine) to the m4 subdir in moses, but it still doesn't work. Now it complains about a bunch of BOOST stuff. I think the Boost checks use deeper features of autoconf, which are not present in 2.59. - JB Javier Murillo Lopez Weather Decision Technologies, Inc. 201 David L. Boren Blvd, Ste 270 Norman, OK 73072 Ph: (405) 579-7675 Ext 243 2011 American Meteorological Society Award for Outstanding Services to Meteorology by a Corporation -Original Message- From: moses-support-boun...@mit.edu [mailto:moses-support-boun...@mit.edu ] On Behalf Of John Burger Sent: Thursday, April 14, 2011 9:55 AM To: Moses-support Subject: Re: [Moses-support] regenerate-makefiles.sh Barry Haddow wrote: You need a more recent version of autoconf http://comments.gmane.org/gmane.comp.nlp.moses.user/3978 Ha, just spent an hour on this myself. Can someone add a note to that effect at the top of regenerate-makefiles.sh? It currently only says this: # NOTE: # Versions 1.9 (or higher) of aclocal and automake are required. Adding the following line would be great: # Version 1.6 (or higher) of autoconf is required - JB On Thursday 14 April 2011 00:06, Javier Murillo wrote: Hi all, I'm trying to build Moses and get the following errors from regenerate-makefiles.sh. I end up with a huge configure file (20,000+ lines) that seems to be garbled. I will appreciate if anybody has run across the same type of errors and can help with ideas on what to do to fix them. Thank you and regards, Javier configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_GREP is required by... m4/boost.m4:215: BOOST_REQUIRE is expanded from... configure.in:130: the top level configure.in:130: warning: AC_PROG_SED is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_SED is required by... autoconf/general.m4:1799: AC_CACHE_VAL is expanded from... autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from... Calling /usr/bin/autoconf... configure.in:130: warning: AC_PROG_GREP is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_GREP is required by... m4/boost.m4:215: BOOST_REQUIRE is expanded from... configure.in:130: the top level configure.in:130: warning: AC_PROG_SED is m4_require'd but is not m4_defun'd configure.in:130: AC_PROG_SED is required by... autoconf/general.m4:1799: AC_CACHE_VAL is expanded from... autoconf/general.m4:1808: AC_CACHE_CHECK is expanded from... configure:466: error: possibly undefined macro: BOOST_THREAD_LDFLAGS If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. configure:466: error: possibly undefined macro: BOOST_CPPFLAGS configure:466: error: possibly undefined macro: BOOST_ROOT configure:466: error: possibly undefined macro: BOOST_THREAD_LIBS configure:19224: error: possibly undefined macro: AC_PROG_GREP configure:19226: error: possibly undefined macro: AC_PROG_SED configure:20069: error: possibly undefined macro: _AS_ECHO_LOG configure:20070: error: possibly undefined macro: _AC_DO_STDERR autoconf failed Javier Murillo Lopez Weather Decision Technologies, Inc. 201 David L. Boren Blvd, Ste 270 Norman, OK 73072 Ph: (405) 579-7675 Ext 243 [wdt]http://www.wdtinc.com/[iMap_logo] 2011 American Meteorological Society Award for Outstanding Services to Meteorology by a Corporation -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Nondeterminism during decoding: same config, different n-best lists
Lane Schwartz wrote: I've examined the n-best lists, and it seems there are at least a couple of interesting cases. In the simplest case, several translations of a given sentence produce the exact same score, and these tied translations appear in different order during different runs. This is a bit odd, but [not] terribly worrisome. The stranger case is when there are two different decoding runs, and for a given sentence, there are translations that appear only in run A, and different translations that only appear in run B. Both these cases are relevant to something we've occasionally seen, which is non-determinism during =tuning=. This is not surprising given the above, since tuning of course involves decoding. It's hard to reproduce, but we have sometimes seen very different weights coming out of MERT for the exact same system configurations. The problem here is that even very small differences in tuning can result in substantial differences in test results, because of how twitchy BLEU is. Like many folks, we typically run MERT on a cluster. This brings up another source of non-determinism we've theorized about. Some of our clusters are heterogenous, and we've wondered if there might be minor differences in floating point behavior from machine to machine. The assignment of different chunks of the tuning data to different machines is typically non-deterministic, so this might carry over to the actual weights that come out of MERT. Does anyone know how robust the floating point usage in the decoder is under these circumstances? Thanks. - John Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] multi-bleu.pl max
John Morgan wrote: One more question: To get the individual 1, 2, 3, and 4 -gram scores you divide by a total number of corresponding ngrams. From reading the multi-bleu.perl code the total comes from ngrams in the hypothesis (I think). Do you want the total to come from the references or the hypothesis? BLEU is a precision score, so it comes from the hypothesis. - John Burger MITRE On 2/22/11, Loïc BARRAULT loic.barra...@lium.univ-lemans.fr wrote: Hi John, yes this is what we want. Consider the following : REF : the the the the HYP : the Choosing the max would give 4 unigram matches instead of only 1. Cheers, Loïc Le 22/02/11 01:44, John Morgan a écrit : Sorry for the empty message. The attached file has a segment of code that I think is choosing the minimum ngram match count. Is this what you want for BLEU? Don't you want the max? Thanks, John ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support -- Loïc BARRAULT LIUM - Equipe LST Université du Maine -- Regards, John J Morgan ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Use of qsub array in moses-parallel.pl
Chris Dyer wrote: Would it be possible to have some kind of flag that turns this on or off? +1, please. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Use of qsub array in moses-parallel.pl
Lane Schwartz wrote: John, I assume you are saying that you like the current qsub submission mechanism used by moses-parallel.pl, and would like any changes to allow the script to keep working exactly how it is now. Is that correct? Yes - apologies for my new media terseness. (: - JB On Thu, Dec 16, 2010 at 10:30 AM, John Burger j...@mitre.org wrote: Chris Dyer wrote: Would it be possible to have some kind of flag that turns this on or off? +1, please. - John D. Burger MITRE -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, Time Enough For Love ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Use of qsub array in moses-parallel.pl
Lane Schwartz wrote: If you don't mind my asking, I'm curious as to why. Fear of change. (: Basically, I haven't used array jobs at all, and I'm not sure our installation is set up for them. However, if everyone thinks that's highly unlikely, and array jobs are a very standard thing in SGE, then there's no need for a proliferation of config switches, and I withdraw my +1 cents. - JB In the current script, for a job split N ways, moses-parallel.pl creates N temporary bash scripts (each of which will call Moses on one part of the data), then launches each of these scripts via a separate invocation of qsub. This results in N unique qsub jobs, each with its own job id. In my proposed change, moses-parallel.pl would create 1 temporary bash script, then would launch this one script via one call to qsub. The call to qsub would use the flag -t 1-N. This would result in N qsub jobs, each of which would share a common parent task ID. (You can still identify child jobs, since each array child task also has its own child task, ranging from 1 to N.) Everything else would stay exactly as it is now. If there's a legitimate reason to maintain both, then I'm open to doing so, but I don't know any reason why the current method would be preferable to the proposed method. Lane On Thu, Dec 16, 2010 at 1:16 PM, John Burger j...@mitre.org wrote: Lane Schwartz wrote: John, I assume you are saying that you like the current qsub submission mechanism used by moses-parallel.pl, and would like any changes to allow the script to keep working exactly how it is now. Is that correct? Yes - apologies for my new media terseness. (: - JB On Thu, Dec 16, 2010 at 10:30 AM, John Burger j...@mitre.org wrote: Chris Dyer wrote: Would it be possible to have some kind of flag that turns this on or off? +1, please. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Proposal to replace vertical bar as factor delimeter
We have yet to use multiple factors, and long ago made our pipeline, err, pipe-proof. I vote for Ondrej's amendment: - default is non-factored input - surely keep the --factorDelimiter (but make it clear that it does/does not apply also to the phrase, generation and reordering tables) - keep the regular ASCII '|' as the default - John D. Burger MITRE smime.p7s Description: S/MIME cryptographic signature ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Different scores with SRILM and IRSTLM
Kenneth Heafield wrote: kenlm's query tool implicitly places s at the beginning. It doesn't appear in the output, but you can see the effect because the n-gram length after the is 2, not 1. Does this happen when kenlm is called from Moses as well? There seem to me to be many reasons not to do this: How do you know whether full sentences are being translated? What if the translation model already includes sentence boundary tokens? (See my recent message about why this might be desirable) But most importantly: How do you know whether the language model was trained that way? - John Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Different scores with SRILM and IRSTLM
Felipe Sánchez Martínez wrote: * Does SRILM introduces begin-of-sentence and end-of-sentence tokens during training? Yes, by default I believe - see the -no-sos and no-eos switches. * and, during scoring (or decoding)? I don't think Moses adds them - it can't know how you trained the LM. We add them ourselves, and tell SRILM not to add them. (We get some small gain in BLEU by doing this, by the way.) * Does IRSTLM introduces begin-of-sentence and end-of-sentence tokens during scoring (or decoding)? No, unless this has recently changed. if I introduce s and /s when scoring with IRSTLM I get a log prob of -55.3099 (very similar to that of SRILM). This makes sense, given the above. Some of the remaining discrepancy might be explained by the fact that you trained the SRILM model with Kneser-Ney discounting, while IRSTLM uses Witten-Bell by default. This doesn't seem sufficient to completely explain the discrepancy, though. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] wrong alignment
musa ghurab wrote: I trained a system of Chinese-Arabic language, but many alignments are wrong. The same thing to lexical model, where are many words are wrongly aligned Here is an example of lexical model (lex.e2f): The point of Moses is not to get good alignments, but to get good translation output. The target language model will help the decoder to pick good translations, even if the translation probabilities that come out of the alignment do not appear to be ideal. A great deal of research effort has been wasted (in my opinion) on getting better alignments, without actually achieving better translation. Have you run the resulting models on a test set? What was the score? How big is your language model? More LM data is probably the easiest way to make up for what might appear to be poor alignments. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Problrm in Decoding step !
Somayeh Bakhshaei wrote: How it is may that moses translate one sentences into two sentences ?!! This is what is happening in my test set. Moses doesn't know what a sentence is. Do you mean that your output has a period in the middle of the output sequence? There's nothing special about the period as a token, and nothing to prevent Moses from emitting it somewhere other than the end of the output (except that the LM might make it unlikely to be followed by anything else). You might find that your score goes up if you filter these out. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Handling unknown words in Moses
Philipp Koehn wrote: this is not correct - LM cost is in the future cost estimate. Obviously, this is a rather low probability, depending on if the language model was trained with open or closed vocabulary. And also whether the word is unknown to the LM or not, yes? Typically there are many more words in the language model's vocabulary than in the phrase table. The reordering of unknown words does cause often some strange reordering, due to the fact that an unknown word creates an unknown context for following words, and some words may prefer more than others to appear in such an unknown context. These issues suggest to me that there might be some gain in dividing unknown words into a number of different classes. (I don't mean Moses would do this, but that it would be some sort of pre- and post- processing steps that swap real words for a few placeholder tokens.) This could be quite simple (UNK_NUM vs. UNK_ALPHA vs. UNK_MIXED) or a more sophisticated unsupervised statistical model. Has anyone tried anything like this, specifically with Moses systems? Thanks. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] forcing a translation with -xml-input flag
OK, so just to be painfully clear, the five (by default) weights for the translation model are not used at all for a phrase from the XML markup, correct? What about the distortion weights? Thanks. - John D. Burger MITRE On Jul 6, 2010, at 12:23, Philipp Koehn wrote: Hi, by default, the translation model probabilities are set to 1, but you can specify a different value with prob, i.e.: xml translation=big shoe prob=0.5 Riesenstiefel /xml See moses/src/XmlOption.cpp for the code and http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc4 for documentation. -phi On Tue, Jul 6, 2010 at 4:55 PM, John Burger j...@mitre.org wrote: Philipp Koehn wrote: there are different modes for trating the XML markup: either inclusive or exclusive. In both cases, the specified XML translation is added to the set of translation options that can be used by the decoder. In the exclusive case, all other translation options that cover the same input words are thrown out, so the decoder is forced to used the specified translation. The specified translation options is treated just like any other translation option: it is scored with the language model, etc. But where do all the other feature values come from, e.g., the ones usually found in the phrase table? The XML markup allows for only a single probability - how is this combined with any LM scores? - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] What is the use of the lm parameter in the model training stage?
the LM is used only to create a formally correct configuration file. You can simply set any NON EMPTY file, to complete the training successfully. Of course you have to modify the configfile with your good LM before translating Or you could simply do something like this: % echo FAKE factored-corpus/surface.lm % train-model.perl \ --corpus factored-corpus/proj-syndicate \ --root-dir unfactored \ --f de --e en \ --lm 0:3:factored-corpus/surface.lm:0 Then you don't have to change the config file later, and you can build the lm in parallel with the model. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Combine Berkeley Aligner and GIZA++ in training ?
haithem afli wrote: I would like to combine multiple world alignment strategies , in order to combine the output of Berkeley Aligner and GIZA++ in training. Can anyone explain to me what can i do ? I think a common approach is to run them both, then simply append the two versions of the aligned corpora before phrase extraction. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] training fails on 1.4million fr-en sentence pairs
C:\cygwin\home\moses\tools\bin\snt2cooc.out: *** fatal error - cmalloc would have returned NULL = Am I running short of RAM? Yes - malloc is failing to get more memory. FWIW, I run phrase extraction on a machine with 66G, but that's probably more than is necessary. You could try extracting shorter phrases - I think the default is 7, so you could try this: train-factored-phrase-model.perl ... --max-phrase-length 4 - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] alignment problem
Catharine Oertel wrote: I have a huge problem aligning my source and target language and I would appreciate your advice very much. The sentence length ratio of my source and target language is in average about 9:1. So I have much more words in my source language than I do have in my target language. I found that the intersect alignment method is working much better for me than the grow-diag- final. However, I do not get satisfactory results which I assume has also to do with the occurrence of ERROR 2. That is a fairly large ratio - if you tell us your language pair, we might have suggestions for different ways to cast the problem. By ERROR 2, do you mean type II errors, that is, false negatives? - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] FW: lexical weighting
Bertoldi and Federico (2009) tackle a related problem when combining multiple phrase tables: http://www.aclweb.org/anthology/W/W09/W09-0432.pdf They have to come up with phrase scores for entries that aren't in all of the base tables. They infer smoothed estimates using lexical probabilities. This may or may not be useful to you. - John D Burger MITRE On May 22, 2009, at 13:00, Sanne Korzec wrote: Hi, Thanks for the previous replies. I am re-estimating the phrase pair table and enriching it with new phrases. The newly added phrases need values for prob, lw, inverse prob and inverse lw. Sometimes phrases are added in my systems whose lexical weights are unknown. For some the lw can be calculated, but for some it can not. For reasons I won't explain. I need to make a decision what to do with these unknown values. I have considered setting them to a fixed number: e.g. 0.1 or 0.01 or even 0.0001. I have however no clue what the impact of these values are. I was hoping someone could point me in the right direction. I would like to make an educated guess on what this value should be, but I do not have enough experience with MT to do this. I know from the previous replies that the values from the score vector are all multiplied together, after applying an exponential weight. But it would also help if someone could give me or point me towards the exact formula. Thanks in advance, Sanne -Original Message- From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn Sent: vrijdag 8 mei 2009 15:29 To: sa...@kortec.nl Subject: Re: [Moses-support] lexical weighting and inverse probabilities Hi, there should not be any zeros in this table, because that will, as you write, lead to an overall zero probability. -phi On Fri, May 8, 2009 at 11:54 AM, Sanne Korzec sa...@kortec.nl wrote: Ok thanks. Does this mean that if one of these values is zero in the table, one can leave the entry out? Multiplication gives a result of zero. Or does the exponential weight compensate for this? Sanne -Original Message- From: phko...@gmail.com [mailto:phko...@gmail.com] On Behalf Of Philipp Koehn Sent: donderdag 7 mei 2009 20:07 To: sa...@kortec.nl Cc: moses-support@mit.edu Subject: Re: [Moses-support] lexical weighting and inverse probabilities Hi, they are all multiplied together, after applying an exponential weight. -phi On Thu, May 7, 2009 at 4:51 PM, Sanne Korzec sa...@kortec.nl wrote: Hi, The final phrase pair table usually has a score vector of length 5: The components are: probability, lexical weights, inverse probability, inverde lex. Weights and a constant. How and why are the lexical weights, the inverse probabilities and the inverse lexical weighting exactly used during decoding? Sanne ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] processLexicalTable throws std::bad_alloc error
Mirko Plitt wrote: To close the loop on this one, in case anyone else runs into this. Turns out the reordering table contained a handful offending lines which triggered the abort: ^K ||| ^K ||| 0.818182 0.0909091 0.0909091 0.818182 0.0909091 0.0909091 ^K ||| désactivés ||| 0.6 0.2 0.2 0.6 0.2 0.2 ^K ||| en ||| 0.2 0.2 0.6 0.2 0.2 0.6 ^K ||| la ||| 0.714286 0.142857 0.142857 0.714286 0.142857 0.142857 Based on recent experiences with corrupted data in the UN Chinese- English corpus, I now have something in my data prep pipeline that strips out any lines, on either side, with any ASCII control characters. I do this in Python, but something like the following would work with Perl: perl -ne 'print m/[\000-\010\013\016-\037\177]/ ? \n : $_;' (Control-K is \013.) This replaces any lines containing such characters with an empty line. I run the Python equivalent of this on both sides of my parallel data, separately. Later, the clean-corpus- n.perl script in the Moses training pipeline strips out the entire pair, since one side has zero tokens. Note that this works for ASCII or UTF8 data, but something else may be appropriate for other character encodings. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] another lw question
Sanne Korzec wrote: I have a question on lexical weighting from the paper: philipp koehn, och, marcu. Statistical Phrase Based Translation. On page 5, subsection 4.4 Lexical weighting, an example is given how to compute lexical weights. ... But then, how can source word f2 be mapped to two target words? Viterbi alignments only allow each source word to be mapped to one target word. What’s going on here? I haven't refreshed my memory of that paper, but I suspect these alignments are after symmetrization, where the Viterbi alignments from both directions are (heuristically) merged. This often produces many- to-many alignments. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] another lw question
I can't say, depends on what you're doing, I suppose. I think running GIZA in both directions and then merging the alignments in some fashion is now widely accepted as The Right Thing To Do, at least in terms of translation performance. Your mileage may vary for other pursuits. - John D. Burger MITRE On May 11, 2009, at 10:27, Sanne Korzec wrote: Ok thanks.. I only have access to the giza produced viterbi alignments. Will it distort my experiments much if I use these instead? Regards, Sanne -Original Message- From: John Burger [mailto:j...@mitre.org] Sent: maandag 11 mei 2009 15:27 To: sa...@kortec.nl Cc: moses-support@mit.edu Subject: Re: [Moses-support] another lw question Sanne Korzec wrote: I have a question on lexical weighting from the paper: philipp koehn, och, marcu. Statistical Phrase Based Translation. On page 5, subsection 4.4 Lexical weighting, an example is given how to compute lexical weights. ... But then, how can source word f2 be mapped to two target words? Viterbi alignments only allow each source word to be mapped to one target word. What's going on here? I haven't refreshed my memory of that paper, but I suspect these alignments are after symmetrization, where the Viterbi alignments from both directions are (heuristically) merged. This often produces many- to-many alignments. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Giza++ input tokens (templates)
James Read wrote: Forgive me for my ignorance but what exactly is the problem with using Giza++ for n-gram alignment? A single word is just a string of letters. An n-gram is a string of letters with some spaces in between. Why should using Giza for aligning strings of letters with spaces in between be any different to aligning strings of letters? Is this just a problem of computation time and limited computational resources? Ngrams are not simply words with spaces in them - ngrams =overlap=, while words do not. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Future costs calculation in MOSES
Hieu Hoang wrote: i think you're asking why the unigram and bigram LM scores of the 1st two words are used to calculate future scores when the LM is a trigram. Just an aside - you're only talking about the LM used for the future score, correct? The order of the main LM is whatever we build with SRILM or IRSTLM, etc. I presume Moses doesn't even have to know many of the details of this LM, it just hands a partially generated output sequence to the LM library. Another aside - assuming I'm correct above, where does the future score trigram LM come from? Thanks. - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] Language model
Michael Zuckerman wrote: Could you please explain about the format of .lm file generated by the script ngram-count. http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html - John D. Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support
Re: [Moses-support] mert stop-continue error
musa ghurab wrote: i was running mert-moses.pl and it was working fine, then i stoped it after 30 hours of tunning, and then continue using the option -- continue, after 50 hours i stop it again but at this time i couldn't continue, i got the following error. Failed to find the step number, failed to read finished_step.txt at training/mert-moses.pl line 436. I don't have anything too concrete to say, but I have had a similar issue where MERT couldn't restart because it couldn't find the weights file. I suspect that if you kill MERT just right, it can't recover, presumably because it is not doing atomic updates. This only happened to me once, however. - John Burger MITRE ___ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support