Re: [Moses-support] Incremental training

Sandipan Dandapat Wed, 19 Nov 2014 08:17:06 -0800

When I am using your script then it has no problem. But when modified the
lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
i used these two commands.


sh full_train.sh org.en org.fr
 sh align_new.sh inc.en inc.fr org.en org.fr

Is the above right?

I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE)
hard-coded in the scripts.


On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote:

> Cannot open file???
> Does the file exist??
> Aee you passing the path properly?
>
>
> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]>
> wrote:
>
>> Hi,
>> I made the changes based on your suggestions, its now generating a
>> different error as below:
>>
>>
>> reading vocabulary files
>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>
>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil
>>
>> I am attaching the working dir and the .py scripts here with. I have the
>> 10 parallel sentences for incremental alignment is in inc_data/ where as
>> the original 500 sentences are there in mtdata/ directory
>>
>> Thanks a ton for your help.
>>
>> Regards,
>> sandipan
>>
>> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote:
>>
>>> Hey,
>>>
>>> I am pretty sure that my script does not generate duplicate token id.
>>>
>>> In fact, I used to get the same error till I modified the script.
>>>
>>> In case you do want to avoid this error and not use my script then:
>>>
>>> 1. Open the original python script: plain2snt-hasvcb.py
>>> 2. There is a line which increments the id counter by 1 ( the line is
>>> nid = len(fvcb)+1;)
>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
>>> starts from 1, and thus if you have 23 tokens then the id will go from 2 to
>>> 24. The original update script will do: nid = 23 + 1 = 24 and the
>>> modification will give 25 correctly). This is in 2 places: nid =
>>> len(evcb)+2;
>>>
>>> Do this and it will work.
>>>
>>> In any case... send me a zip file of your working directory (if its
>>> small.... you are testing it on small data right ? ). I will see what the
>>> problem is.
>>>
>>>
>>>
>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
>>> [email protected]> wrote:
>>>
>>>> Dear Raj,
>>>> I also tried to use your scripts for incremental alignment. I copied
>>>> your python script in the desired directory still I am receiving the same
>>>> error as posted by Ihab.
>>>> reading vocabulary files
>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>>> ERROR: TOKEN ID must be unique for each token, in line :
>>>> 24 roi 2
>>>> TOKEN ID 24 has already been assigned to: roi
>>>>
>>>> I took only 500 sentences pairs for full_train.sh and it worked fine
>>>> with 758 lines in the corpus/tgt_filename.vcb file
>>>>
>>>> I took only 10 sentences for incremental alignment_new.sh which
>>>> generated the error and I found 8054 lines in the
>>>> new_corpus/new_tgt_file.vcb
>>>> Is there any problem? Can you please help me on the same.
>>>>
>>>> Thanks and regards,
>>>> sandipan
>>>>
>>>>
>>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote:
>>>>
>>>>> Dear Ihab.
>>>>> There is a python script that was there in the google drive folder in
>>>>> the first mail I sent you.
>>>>> Please replace the existing file with my copy.
>>>>>
>>>>> It has to work.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>> Sent from Samsung Mobile
>>>>>
>>>>>
>>>>>
>>>>> -------- Original message --------
>>>>> From: Ihab Ramadan <[email protected]>
>>>>> Date: 05/11/2014 00:54 (GMT+09:00)
>>>>> To: 'Raj Dabre' <[email protected]>
>>>>> Cc: [email protected]
>>>>> Subject: RE: [Moses-support] Incremental training
>>>>>
>>>>>
>>>>> Dear Raj,
>>>>>
>>>>> Your point is clear and I try to follow the steps you mentioned but I
>>>>> stuck now in the align_new.sh script which gives me this error
>>>>>
>>>>> reading vocabulary files
>>>>>
>>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>>>>
>>>>> ERROR: TOKEN ID must be unique for each token, in line :
>>>>>
>>>>> 29107 q-1 4
>>>>>
>>>>> Do you have any idea what this error means?
>>>>>
>>>>>
>>>>>
>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>>>>> *To:* [email protected]
>>>>> *Cc:* [email protected]
>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>
>>>>>
>>>>>
>>>>> Dear Ihab,
>>>>>
>>>>> Perhaps I should have mentioned much more clearly what my script does.
>>>>> Sorry for that.
>>>>>
>>>>> Let me start with this: There is no direct/easy way to generate the
>>>>> moses.ini file as you need.
>>>>>
>>>>> 1. Suppose you have 2 million lines of parallel corpora and you
>>>>> trained a SMT system for it. This naturally gives the phrase table,
>>>>> reordering table and moses.ini.
>>>>>
>>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there are
>>>>> 2 ways:
>>>>>
>>>>>     a. Retrain 2.5 million lines from scratch (will take lots of time:
>>>>> ~ 2-3 days on a regular machines)
>>>>>
>>>>>     b. Train on only the 500k new lines using the alignment
>>>>> information of the original training data. (Faster: ~ 6-7 hours).
>>>>>
>>>>>
>>>>>
>>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
>>>>> TABLES.*
>>>>>
>>>>> 1. full_train.sh -------------- This trains on the original corpus of
>>>>> 2 million lines. (Generate alignment files only for the original corpus)
>>>>>
>>>>> 2. align_new.sh -------------- This trains on the new corpus of 500 k
>>>>> lines. (Generate alignment files only for the new corpus using the
>>>>> alignments for 1)
>>>>>
>>>>>
>>>>>
>>>>> *Why this split ????* Because the basic training step of Moses does
>>>>> not preserve the alignment probability information. Only the alignments 
>>>>> are
>>>>> saved. To continue training we need the probability information.
>>>>>
>>>>> You can pass flags to moses to preserve this information ( this flag
>>>>> is --giza-option . If you do this then you will not need
>>>>> full_train.sh. But you will have to change the config files before using
>>>>> align_new.sh)
>>>>>
>>>>> *HOW TO GET UPDATED PHRASE TABLE:*
>>>>>
>>>>> 1. Append the forward alignments (fwd) generated by align_new.sh to
>>>>> the forward (fwd) alignments generated by full_train.sh.
>>>>> 2. Append the inverse alignments (inv) generated by align_new.sh to
>>>>> the inverse (inv) alignments generated by full_train.sh.
>>>>>
>>>>> 3. Run the moses training script with additional flags:
>>>>>
>>>>>    - --first-step -- first step in the training process (default
>>>>>    1)--------------- This will be 4
>>>>>    - --last-step -- last step in the training process (default
>>>>>    7)------------ This will remain 7
>>>>>    - --giza-f2e -- <path to folder>/new_giza.fwd
>>>>>    - --giza-e2f -- <path to folder>/new_giza.inv
>>>>>
>>>>> For example:
>>>>>
>>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training 
>>>>> directory> \
>>>>>
>>>>>  -corpus <your new corpus name>                             \
>>>>>
>>>>>  -f <src> -e <tgt> -alignment grow-diag-final-and -reordering 
>>>>> msd-bidirectional-fe \
>>>>>
>>>>>  -lm 0:3:<path to LM>:8                          \
>>>>>  --first-step 4  --last-step 7 --giza-f2e -- <path to 
>>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \
>>>>>  -external-bin-dir <path to giza++ binaries>
>>>>>
>>>>> For more details on the training step read this:
>>>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>>>>>
>>>>> What this does is assumes that you have alignments and continue the
>>>>> phrase extraction, reordering and generate the new moses.ini file.
>>>>>
>>>>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*
>>>>>
>>>>>
>>>>>
>>>>> If you are still unclear then please ask and I will try to help you as
>>>>> much as I can.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Dear Raj,
>>>>>
>>>>> That’s a great work my friend,
>>>>>
>>>>> This files make the script work but it takes long time to finish also
>>>>> it did not generate the model folder which contain the moses.ini file
>>>>>
>>>>> Is this normal?
>>>>>
>>>>> And I now try to run it again as I suspect that the server was shut
>>>>> down before the training was completed but i notice that it starts form 
>>>>> the
>>>>> beginning and did not use the existing files generated
>>>>>
>>>>> Thanks Raj it still a great work
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>> *Sent:* Thursday, October 30, 2014 4:54 PM
>>>>>
>>>>>
>>>>> *To:* [email protected]
>>>>> *Cc:* [email protected]
>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>
>>>>>
>>>>>
>>>>> Ahh.... i totally forgot that part.
>>>>>
>>>>> Sorry.
>>>>>
>>>>> PFA.
>>>>>
>>>>> Just place them in the folder where the shell scripts full_train.sh
>>>>> and align_new.sh are.
>>>>>
>>>>> Hopefully it should run now.
>>>>>
>>>>> Please let me know if you succeed.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Dear Raj,
>>>>>
>>>>> It is a great solution
>>>>>
>>>>> I installed MGIZA++ successfully and I am using your scripts to run
>>>>> training
>>>>>
>>>>> And I followed the steps you mentioned but I faces this error when I
>>>>> was running the full_train.sh script
>>>>>
>>>>>
>>>>>
>>>>> bla bla  bla
>>>>>
>>>>> .
>>>>>
>>>>> .
>>>>>
>>>>> .
>>>>>
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>> Starting MGIZA
>>>>>
>>>>> Initializing Global Paras
>>>>>
>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>>>>
>>>>> ERROR:  Cannot open configuration file configgiza.fwd!
>>>>>
>>>>> Starting MGIZA
>>>>>
>>>>> Initializing Global Paras
>>>>>
>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>>>>
>>>>> ERROR:  Cannot open configuration file configgiza.rev!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This two files does not exists
>>>>>
>>>>> should they be generated from the installation?
>>>>>
>>>>> How to get them?
>>>>>
>>>>>
>>>>>
>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>> *Sent:* Sunday, October 26, 2014 6:21 PM
>>>>> *To:* [email protected]
>>>>> *Cc:* [email protected]
>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>
>>>>>
>>>>>
>>>>> Hello Ihab,
>>>>>
>>>>> I would suggest using mgiza++.
>>>>> http://www.kyloo.net/software/doku.php/mgiza:overview
>>>>>
>>>>> It is very easy to use.
>>>>>
>>>>> I also wrote some scripts to make it easy for training.
>>>>> Visit the link below for my scripts.
>>>>>
>>>>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing
>>>>>
>>>>> Usage:
>>>>>
>>>>> To train basic IBM models:
>>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name>
>>>>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
>>>>>
>>>>> To align 2 new files using previously trained models (aka continue
>>>>> training).
>>>>>
>>>>> bash align_new.sh <new_src_corpus_file_name>
>>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name>
>>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base>
>>>>> <path_to_mgizapp_installation>
>>>>>
>>>>> There is also a python script which you had better replace in the
>>>>> scripts folder of mgiza++. I have modified it to work with my scripts.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Dear All,
>>>>>
>>>>> I just need a clear steps on how to do incremental training in moses,
>>>>> as the illustration in the manual is not cleared enough
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Best Regards
>>>>>
>>>>> *Ihab Ramadan*| Senior Developer| Saudisoft
>>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37  Ext- 0
>>>>> | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image: linked]
>>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>*
>>>>>  |
>>>>> **[image: ZA102637861]*
>>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>*
>>>>>  |
>>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Raj Dabre.
>>>>> Research Student,
>>>>>
>>>>> Graduate School of Informatics,
>>>>> Kyoto University.
>>>>>
>>>>> CSE MTech, IITB., 2011-2014
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Raj Dabre.
>>>>> Research Student,
>>>>>
>>>>> Graduate School of Informatics,
>>>>> Kyoto University.
>>>>>
>>>>> CSE MTech, IITB., 2011-2014
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Raj Dabre.
>>>>> Research Student,
>>>>>
>>>>> Graduate School of Informatics,
>>>>> Kyoto University.
>>>>>
>>>>> CSE MTech, IITB., 2011-2014
>>>>>
>>>>> _______________________________________________
>>>>> Moses-support mailing list
>>>>> [email protected]
>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Raj Dabre.
>>> Research Student,
>>> Graduate School of Informatics,
>>> Kyoto University.
>>> CSE MTech, IITB., 2011-2014
>>>
>>>
>>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Incremental training

Reply via email to