Re: [Moses-support] Incremental training

Raj Dabre Wed, 19 Nov 2014 07:21:13 -0800

Hey,

I am pretty sure that my script does not generate duplicate token id.


In fact, I used to get the same error till I modified the script.

In case you do want to avoid this error and not use my script then:

1. Open the original python script: plain2snt-hasvcb.py
2. There is a line which increments the id counter by 1 ( the line is nid =
len(fvcb)+1;)
3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
starts from 1, and thus if you have 23 tokens then the id will go from 2 to
24. The original update script will do: nid = 23 + 1 = 24 and the
modification will give 25 correctly). This is in 2 places: nid =
len(evcb)+2;

Do this and it will work.

In any case... send me a zip file of your working directory (if its
small.... you are testing it on small data right ? ). I will see what the
problem is.



On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
[email protected]> wrote:

> Dear Raj,
> I also tried to use your scripts for incremental alignment. I copied your
> python script in the desired directory still I am receiving the same error
> as posted by Ihab.
> reading vocabulary files
> Reading vocabulary file from:new_corpus/inc.fr.vcb
> ERROR: TOKEN ID must be unique for each token, in line :
> 24 roi 2
> TOKEN ID 24 has already been assigned to: roi
>
> I took only 500 sentences pairs for full_train.sh and it worked fine with
> 758 lines in the corpus/tgt_filename.vcb file
>
> I took only 10 sentences for incremental alignment_new.sh which generated
> the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb
> Is there any problem? Can you please help me on the same.
>
> Thanks and regards,
> sandipan
>
>
> On 4 November 2014 16:13, prajdabre <[email protected]> wrote:
>
>> Dear Ihab.
>> There is a python script that was there in the google drive folder in the
>> first mail I sent you.
>> Please replace the existing file with my copy.
>>
>> It has to work.
>>
>> Regards.
>>
>>
>> Sent from Samsung Mobile
>>
>>
>>
>> -------- Original message --------
>> From: Ihab Ramadan <[email protected]>
>> Date: 05/11/2014 00:54 (GMT+09:00)
>> To: 'Raj Dabre' <[email protected]>
>> Cc: [email protected]
>> Subject: RE: [Moses-support] Incremental training
>>
>>
>> Dear Raj,
>>
>> Your point is clear and I try to follow the steps you mentioned but I
>> stuck now in the align_new.sh script which gives me this error
>>
>> reading vocabulary files
>>
>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>
>> ERROR: TOKEN ID must be unique for each token, in line :
>>
>> 29107 q-1 4
>>
>> Do you have any idea what this error means?
>>
>>
>>
>> *From:* Raj Dabre [mailto:[email protected]]
>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>> *To:* [email protected]
>> *Cc:* [email protected]
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Dear Ihab,
>>
>> Perhaps I should have mentioned much more clearly what my script does.
>> Sorry for that.
>>
>> Let me start with this: There is no direct/easy way to generate the
>> moses.ini file as you need.
>>
>> 1. Suppose you have 2 million lines of parallel corpora and you trained a
>> SMT system for it. This naturally gives the phrase table, reordering table
>> and moses.ini.
>>
>> 2. Suppose you got 500 k more lines of parallel corpora.... there are 2
>> ways:
>>
>>     a. Retrain 2.5 million lines from scratch (will take lots of time: ~
>> 2-3 days on a regular machines)
>>
>>     b. Train on only the 500k new lines using the alignment information
>> of the original training data. (Faster: ~ 6-7 hours).
>>
>>
>>
>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
>> TABLES.*
>>
>> 1. full_train.sh -------------- This trains on the original corpus of 2
>> million lines. (Generate alignment files only for the original corpus)
>>
>> 2. align_new.sh -------------- This trains on the new corpus of 500 k
>> lines. (Generate alignment files only for the new corpus using the
>> alignments for 1)
>>
>>
>>
>> *Why this split ????* Because the basic training step of Moses does not
>> preserve the alignment probability information. Only the alignments are
>> saved. To continue training we need the probability information.
>>
>> You can pass flags to moses to preserve this information ( this flag is
>> --giza-option . If you do this then you will not need full_train.sh. But
>> you will have to change the config files before using align_new.sh)
>>
>> *HOW TO GET UPDATED PHRASE TABLE:*
>>
>> 1. Append the forward alignments (fwd) generated by align_new.sh to the
>> forward (fwd) alignments generated by full_train.sh.
>> 2. Append the inverse alignments (inv) generated by align_new.sh to the
>> inverse (inv) alignments generated by full_train.sh.
>>
>> 3. Run the moses training script with additional flags:
>>
>>    - --first-step -- first step in the training process (default
>>    1)--------------- This will be 4
>>    - --last-step -- last step in the training process (default
>>    7)------------ This will remain 7
>>    - --giza-f2e -- <path to folder>/new_giza.fwd
>>    - --giza-e2f -- <path to folder>/new_giza.inv
>>
>> For example:
>>
>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training 
>> directory> \
>>
>>  -corpus <your new corpus name>                             \
>>
>>  -f <src> -e <tgt> -alignment grow-diag-final-and -reordering 
>> msd-bidirectional-fe \
>>
>>  -lm 0:3:<path to LM>:8                          \
>>  --first-step 4  --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd 
>> --giza-e2f -- <path to folder>/new_giza.inv \
>>  -external-bin-dir <path to giza++ binaries>
>>
>> For more details on the training step read this:
>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>>
>> What this does is assumes that you have alignments and continue the
>> phrase extraction, reordering and generate the new moses.ini file.
>>
>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*
>>
>>
>>
>> If you are still unclear then please ask and I will try to help you as
>> much as I can.
>>
>> Regards.
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <[email protected]>
>> wrote:
>>
>> Dear Raj,
>>
>> That’s a great work my friend,
>>
>> This files make the script work but it takes long time to finish also it
>> did not generate the model folder which contain the moses.ini file
>>
>> Is this normal?
>>
>> And I now try to run it again as I suspect that the server was shut down
>> before the training was completed but i notice that it starts form the
>> beginning and did not use the existing files generated
>>
>> Thanks Raj it still a great work
>>
>>
>>
>>
>>
>> *From:* Raj Dabre [mailto:[email protected]]
>> *Sent:* Thursday, October 30, 2014 4:54 PM
>>
>>
>> *To:* [email protected]
>> *Cc:* [email protected]
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Ahh.... i totally forgot that part.
>>
>> Sorry.
>>
>> PFA.
>>
>> Just place them in the folder where the shell scripts full_train.sh and
>> align_new.sh are.
>>
>> Hopefully it should run now.
>>
>> Please let me know if you succeed.
>>
>>
>>
>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <[email protected]>
>> wrote:
>>
>> Dear Raj,
>>
>> It is a great solution
>>
>> I installed MGIZA++ successfully and I am using your scripts to run
>> training
>>
>> And I followed the steps you mentioned but I faces this error when I was
>> running the full_train.sh script
>>
>>
>>
>> bla bla  bla
>>
>> .
>>
>> .
>>
>> .
>>
>> .
>>
>>
>>
>> Starting MGIZA
>>
>> Initializing Global Paras
>>
>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>
>> ERROR:  Cannot open configuration file configgiza.fwd!
>>
>> Starting MGIZA
>>
>> Initializing Global Paras
>>
>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>
>> ERROR:  Cannot open configuration file configgiza.rev!
>>
>>
>>
>>
>>
>> This two files does not exists
>>
>> should they be generated from the installation?
>>
>> How to get them?
>>
>>
>>
>> *From:* Raj Dabre [mailto:[email protected]]
>> *Sent:* Sunday, October 26, 2014 6:21 PM
>> *To:* [email protected]
>> *Cc:* [email protected]
>> *Subject:* Re: [Moses-support] Incremental training
>>
>>
>>
>> Hello Ihab,
>>
>> I would suggest using mgiza++.
>> http://www.kyloo.net/software/doku.php/mgiza:overview
>>
>> It is very easy to use.
>>
>> I also wrote some scripts to make it easy for training.
>> Visit the link below for my scripts.
>>
>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing
>>
>> Usage:
>>
>> To train basic IBM models:
>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name>
>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
>>
>> To align 2 new files using previously trained models (aka continue
>> training).
>>
>> bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name>
>> <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base>
>> <corpus_folder_base> <path_to_mgizapp_installation>
>>
>> There is also a python script which you had better replace in the scripts
>> folder of mgiza++. I have modified it to work with my scripts.
>>
>> Hope this helps.
>>
>>
>>
>>
>>
>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <[email protected]>
>> wrote:
>>
>> Dear All,
>>
>> I just need a clear steps on how to do incremental training in moses, as
>> the illustration in the manual is not cleared enough
>>
>> Thanks
>>
>>
>>
>> Best Regards
>>
>> *Ihab Ramadan*| Senior Developer| Saudisoft <http://www.saudisoft.com/>
>> - Egypt | *Tel * +2 02 330 320 37  Ext- 0 | Mob+201007570826 | Fax
>> +20233032036 | *Follow us on *[image: linked]
>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>*
>>  |
>> **[image: ZA102637861]*
>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>*
>>  |
>> **[image: ZA102637858]* <https://twitter.com/Saudisoft>
>>
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>>
>> --
>>
>> Raj Dabre.
>> Research Student,
>>
>> Graduate School of Informatics,
>> Kyoto University.
>>
>> CSE MTech, IITB., 2011-2014
>>
>>
>>
>>
>> --
>>
>> Raj Dabre.
>> Research Student,
>>
>> Graduate School of Informatics,
>> Kyoto University.
>>
>> CSE MTech, IITB., 2011-2014
>>
>>
>>
>>
>> --
>>
>> Raj Dabre.
>> Research Student,
>>
>> Graduate School of Informatics,
>> Kyoto University.
>>
>> CSE MTech, IITB., 2011-2014
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>


-- 
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Incremental training

Reply via email to