Re: [Moses-support] FW: Moses: Prepare Data, Build Language Model and Train Model

Llio Humphreys Fri, 01 Aug 2008 09:48:43 -0700

Dear Hieu,
this is most useful.  Thank you very much for the lead.  Do you know
the giza program I need to amend?  I take it that the file should not
be overwritten.  Is this the same filename always or does it depend on
the input I give the system?
Many thanks,
Llio Humphreys


On Fri, Aug 1, 2008 at 5:07 PM, Hieu Hoang <[EMAIL PROTECTED]> wrote:
> this may be a smilar problem that was encountered by the UPV guys when
> running under cygwin
>
> the Mac filesystem is case INSENSITIVE.
>        http://docs.info.apple.com/article.html?artnum=107863
> however, giza++ creates 2 files which have the same name but just different
> cases, eg
>    blah.a3.final
>    blah.A3.final
> 1 overwrites the other.
>
> you need to change the giza++ code, or run under a case senesitive
> filesystem. ideally, it should be changed in the trunk giza++ code
>
>
> -----Original Message-----
> From: Josh Schroeder [mailto:[EMAIL PROTECTED]
> Sent: 01 August 2008 16:56
> To: Hieu Hoang
> Subject: Fwd: [Moses-support] Moses: Prepare Data, Build Language Model and
> Train Model
>
>
>
> Begin forwarded message:
>
>> From: "Llio Humphreys" <[EMAIL PROTECTED]>
>> Date: 25 July 2008 10:00:00 BST
>> To: moses-support <[email protected]>
>> Subject: [Moses-support] Moses: Prepare Data, Build Language Model and
>> Train Model
>>
>> Please see message without attachment.  Thank you,  Llio Humphreys
>>
>> On Fri, Jul 25, 2008 at 9:50 AM, Llio Humphreys
>> <[EMAIL PROTECTED]> wrote:
>>> Dear Moses Group,
>>>
>>> I am having difficulties running the Moses software (not the recently
>>> released version), following the guidelines at
>>> http://www.statmt.org/wmt07/baseline.html and I attach a record of
>>> the final part of the terminal session for your information.
>>>
>>> I started with parallel input files, with each line containing one
>>> sentence, both already tokenised, tab delimited, and in ASCII (is
>>> UTF-8 better?)
>>>
>>> I followed the instructions under the Prepare Data heading.  I
>>> briefly inspected the .tok output files, and preferred the original
>>> tokenised version e.g. reference numbers with / were not split up.
>>> So, I renamed the original input files as .tok files, filtered out
>>> long sentences and lowercased the training data.
>>>
>>> I then proceeded to the Language Model. The instructions seemed
>>> pretty much the same as for the Prepare Data section, so I moved the
>>> lowercased files from the corpus directory to the lm directory. Is
>>> this the right thing to do?
>>>
>>> I then trained the model and the system crashed with the following
>>> message:-
>>>
>>> Executing: bin/moses-scripts/scripts-20080125-1939/training/phrase-
>>> extract/extract
>>> ./model/aligned.0.en ./model/aligned.0.cy
>>> ./model/aligned.grow-diag-final-and ./model/extract.0-0 7 orientation
>>> PhraseExtract v1.3.0, written by Philipp Koehn phrase extraction from
>>> an aligned parallel corpus (also extracting orientation)
>>> Executing: cat ./model/extract.0-0.o.part* > ./model/extract.0-0.o
>>> cat: ./model/extract.0-0.o.part*: No such file or directory Exit
>>> code: 1 Died at
>>> bin/moses-scripts/scripts-20080125-1939/training/train-
>>> factored-phrase-model.perl
>>> line 899.
>>>
>>> So, my question is: am I giving Moses the wrong data to work with?
>>>
>>> In order to find out, I downloaded europarl from
>>> http://www.statmt.org/europarl/.  It contained version 2 rather than
>>> version 3 but I thought nevertheless that I might try using it.  I
>>> ran
>>> sentence-align-corpus.perl:
>>>
>>> ./sentence-align-corpus.perl en de
>>>
>>> , but it exited with the following message:
>>>
>>> Died at ./sentence-align-corpus.perl line 16.
>>>
>>> sentence-align-corpus.perl line 16 says:
>>> die unless -e "$dir/$l1";
>>>
>>> Should I continue with europarl 2 or is it possible to download
>>> europarl 3 from somewhere?
>>>
>>> Alternatively would it be possible for you to explain the difference
>>> in purpose and format between wmt07/training/europarl-v3.fr-en.fr and
>>> wmt07/training/europarl-v3.en?  Just to clarify: am I correct in
>>> saying that the Prepare Data section is about training the
>>> translation model i.e. word and phrase alignments, and Language model
>>> section is about creating a language model for the language we're
>>> translating to?
>>> Does the Prepare Data section start with two plain text parallel
>>> corpora with sentences on each line or  is something more elaborate
>>> than that?  Maybe the wmt07/training/europarl-v3.fr-en.fr is a plain
>>> text file with French sentence 1 followed by English sentence 1
>>> followed by French sentence 2 followed by English sentence 2 etc?  I
>>> could then adapt the Welsh-English corpus I'm using accordingly.
>>>
>>> Otherwise, is there a problem with the software/implementation on a
>>> Mac system? Would you recommend that I try the recently released
>>> version of Moses?  Is there some way to install the new version of
>>> Moses without uninstalling the other one (I'm wondering about
>>> environment variables)
>>>
>>> Thank you,
>>> Llio Humphreys
>>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
> --
> The University of Edinburgh is a charitable body, registered in Scotland,
> with registration number SC005336.
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] FW: Moses: Prepare Data, Build Language Model and Train Model

Reply via email to