Re: [Moses-support] Incremental training

Raj Dabre Thu, 20 Nov 2014 01:28:38 -0800

Hey,

I just remembered that I have a pathetic memory.
I forgot to add the lines for sorting the .vcb file in increasing order of
id.


Just add the following lines to align_new.sh after the line ---------
$MGIZA/scripts/plain2snt-hasvcb.py ../corpus/$4.vcb ../corpus/$3.vcb $2 $1
$2_$1.snt $1_$2.snt  $2.vcb $1.vcb :

sort -n $1.vcb > tmp
mv tmp $1.vcb
sort -n $2.vcb > tmp
mv tmp $2.vcb

And it will run perfectly. I am sure of it. I used your folder just to be
sure. It works.
Sorry for my silliness. Lemme know if it works now.

Regards.

On Thu, Nov 20, 2014 at 1:13 AM, Raj Dabre <[email protected]> wrote:

> Well then your paths must be wrong.
> I cant see why the files are not being generated.
> Ill look into it tomorrow and let you know
>
>
> On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]>
> wrote:
>
>> When I am using your script then it has no problem. But when modified the
>> lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
>> i used these two commands.
>>
>> sh full_train.sh org.en org.fr
>>  sh align_new.sh inc.en inc.fr org.en org.fr
>>
>> Is the above right?
>>
>> I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE,
>> NEW_CORPUS_BASE) hard-coded in the scripts.
>>
>>
>> On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote:
>>
>>> Cannot open file???
>>> Does the file exist??
>>> Aee you passing the path properly?
>>>
>>>
>>> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>> I made the changes based on your suggestions, its now generating a
>>>> different error as below:
>>>>
>>>>
>>>> reading vocabulary files
>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>>>
>>>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil
>>>>
>>>> I am attaching the working dir and the .py scripts here with. I have
>>>> the 10 parallel sentences for incremental alignment is in inc_data/ where
>>>> as the original 500 sentences are there in mtdata/ directory
>>>>
>>>> Thanks a ton for your help.
>>>>
>>>> Regards,
>>>> sandipan
>>>>
>>>> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> I am pretty sure that my script does not generate duplicate token id.
>>>>>
>>>>> In fact, I used to get the same error till I modified the script.
>>>>>
>>>>> In case you do want to avoid this error and not use my script then:
>>>>>
>>>>> 1. Open the original python script: plain2snt-hasvcb.py
>>>>> 2. There is a line which increments the id counter by 1 ( the line is
>>>>> nid = len(fvcb)+1;)
>>>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering
>>>>> starts from 1, and thus if you have 23 tokens then the id will go from 2 
>>>>> to
>>>>> 24. The original update script will do: nid = 23 + 1 = 24 and the
>>>>> modification will give 25 correctly). This is in 2 places: nid =
>>>>> len(evcb)+2;
>>>>>
>>>>> Do this and it will work.
>>>>>
>>>>> In any case... send me a zip file of your working directory (if its
>>>>> small.... you are testing it on small data right ? ). I will see what the
>>>>> problem is.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Dear Raj,
>>>>>> I also tried to use your scripts for incremental alignment. I copied
>>>>>> your python script in the desired directory still I am receiving the same
>>>>>> error as posted by Ihab.
>>>>>> reading vocabulary files
>>>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>>>>> ERROR: TOKEN ID must be unique for each token, in line :
>>>>>> 24 roi 2
>>>>>> TOKEN ID 24 has already been assigned to: roi
>>>>>>
>>>>>> I took only 500 sentences pairs for full_train.sh and it worked fine
>>>>>> with 758 lines in the corpus/tgt_filename.vcb file
>>>>>>
>>>>>> I took only 10 sentences for incremental alignment_new.sh which
>>>>>> generated the error and I found 8054 lines in the
>>>>>> new_corpus/new_tgt_file.vcb
>>>>>> Is there any problem? Can you please help me on the same.
>>>>>>
>>>>>> Thanks and regards,
>>>>>> sandipan
>>>>>>
>>>>>>
>>>>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote:
>>>>>>
>>>>>>> Dear Ihab.
>>>>>>> There is a python script that was there in the google drive folder
>>>>>>> in the first mail I sent you.
>>>>>>> Please replace the existing file with my copy.
>>>>>>>
>>>>>>> It has to work.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>> Sent from Samsung Mobile
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -------- Original message --------
>>>>>>> From: Ihab Ramadan <[email protected]>
>>>>>>> Date: 05/11/2014 00:54 (GMT+09:00)
>>>>>>> To: 'Raj Dabre' <[email protected]>
>>>>>>> Cc: [email protected]
>>>>>>> Subject: RE: [Moses-support] Incremental training
>>>>>>>
>>>>>>>
>>>>>>> Dear Raj,
>>>>>>>
>>>>>>> Your point is clear and I try to follow the steps you mentioned but
>>>>>>> I stuck now in the align_new.sh script which gives me this error
>>>>>>>
>>>>>>> reading vocabulary files
>>>>>>>
>>>>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>>>>>>
>>>>>>> ERROR: TOKEN ID must be unique for each token, in line :
>>>>>>>
>>>>>>> 29107 q-1 4
>>>>>>>
>>>>>>> Do you have any idea what this error means?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>>>>>>> *To:* [email protected]
>>>>>>> *Cc:* [email protected]
>>>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dear Ihab,
>>>>>>>
>>>>>>> Perhaps I should have mentioned much more clearly what my script
>>>>>>> does. Sorry for that.
>>>>>>>
>>>>>>> Let me start with this: There is no direct/easy way to generate the
>>>>>>> moses.ini file as you need.
>>>>>>>
>>>>>>> 1. Suppose you have 2 million lines of parallel corpora and you
>>>>>>> trained a SMT system for it. This naturally gives the phrase table,
>>>>>>> reordering table and moses.ini.
>>>>>>>
>>>>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there
>>>>>>> are 2 ways:
>>>>>>>
>>>>>>>     a. Retrain 2.5 million lines from scratch (will take lots of
>>>>>>> time: ~ 2-3 days on a regular machines)
>>>>>>>
>>>>>>>     b. Train on only the 500k new lines using the alignment
>>>>>>> information of the original training data. (Faster: ~ 6-7 hours).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
>>>>>>> TABLES.*
>>>>>>>
>>>>>>> 1. full_train.sh -------------- This trains on the original corpus
>>>>>>> of 2 million lines. (Generate alignment files only for the original 
>>>>>>> corpus)
>>>>>>>
>>>>>>> 2. align_new.sh -------------- This trains on the new corpus of 500
>>>>>>> k lines. (Generate alignment files only for the new corpus using the
>>>>>>> alignments for 1)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Why this split ????* Because the basic training step of Moses does
>>>>>>> not preserve the alignment probability information. Only the alignments 
>>>>>>> are
>>>>>>> saved. To continue training we need the probability information.
>>>>>>>
>>>>>>> You can pass flags to moses to preserve this information ( this
>>>>>>> flag is --giza-option . If you do this then you will not need
>>>>>>> full_train.sh. But you will have to change the config files before using
>>>>>>> align_new.sh)
>>>>>>>
>>>>>>> *HOW TO GET UPDATED PHRASE TABLE:*
>>>>>>>
>>>>>>> 1. Append the forward alignments (fwd) generated by align_new.sh to
>>>>>>> the forward (fwd) alignments generated by full_train.sh.
>>>>>>> 2. Append the inverse alignments (inv) generated by align_new.sh to
>>>>>>> the inverse (inv) alignments generated by full_train.sh.
>>>>>>>
>>>>>>> 3. Run the moses training script with additional flags:
>>>>>>>
>>>>>>>    - --first-step -- first step in the training process (default
>>>>>>>    1)--------------- This will be 4
>>>>>>>    - --last-step -- last step in the training process (default
>>>>>>>    7)------------ This will remain 7
>>>>>>>    - --giza-f2e -- <path to folder>/new_giza.fwd
>>>>>>>    - --giza-e2f -- <path to folder>/new_giza.inv
>>>>>>>
>>>>>>> For example:
>>>>>>>
>>>>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your 
>>>>>>> training directory> \
>>>>>>>
>>>>>>>  -corpus <your new corpus name>                             \
>>>>>>>
>>>>>>>  -f <src> -e <tgt> -alignment grow-diag-final-and -reordering 
>>>>>>> msd-bidirectional-fe \
>>>>>>>
>>>>>>>  -lm 0:3:<path to LM>:8                          \
>>>>>>>  --first-step 4  --last-step 7 --giza-f2e -- <path to 
>>>>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \
>>>>>>>  -external-bin-dir <path to giza++ binaries>
>>>>>>>
>>>>>>> For more details on the training step read this:
>>>>>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>>>>>>>
>>>>>>> What this does is assumes that you have alignments and continue the
>>>>>>> phrase extraction, reordering and generate the new moses.ini file.
>>>>>>>
>>>>>>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> If you are still unclear then please ask and I will try to help you
>>>>>>> as much as I can.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> Dear Raj,
>>>>>>>
>>>>>>> That’s a great work my friend,
>>>>>>>
>>>>>>> This files make the script work but it takes long time to finish
>>>>>>> also it did not generate the model folder which contain the moses.ini 
>>>>>>> file
>>>>>>>
>>>>>>> Is this normal?
>>>>>>>
>>>>>>> And I now try to run it again as I suspect that the server was shut
>>>>>>> down before the training was completed but i notice that it starts form 
>>>>>>> the
>>>>>>> beginning and did not use the existing files generated
>>>>>>>
>>>>>>> Thanks Raj it still a great work
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>>>> *Sent:* Thursday, October 30, 2014 4:54 PM
>>>>>>>
>>>>>>>
>>>>>>> *To:* [email protected]
>>>>>>> *Cc:* [email protected]
>>>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ahh.... i totally forgot that part.
>>>>>>>
>>>>>>> Sorry.
>>>>>>>
>>>>>>> PFA.
>>>>>>>
>>>>>>> Just place them in the folder where the shell scripts full_train.sh
>>>>>>> and align_new.sh are.
>>>>>>>
>>>>>>> Hopefully it should run now.
>>>>>>>
>>>>>>> Please let me know if you succeed.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> Dear Raj,
>>>>>>>
>>>>>>> It is a great solution
>>>>>>>
>>>>>>> I installed MGIZA++ successfully and I am using your scripts to run
>>>>>>> training
>>>>>>>
>>>>>>> And I followed the steps you mentioned but I faces this error when I
>>>>>>> was running the full_train.sh script
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> bla bla  bla
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>> .
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Starting MGIZA
>>>>>>>
>>>>>>> Initializing Global Paras
>>>>>>>
>>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>>>>>>
>>>>>>> ERROR:  Cannot open configuration file configgiza.fwd!
>>>>>>>
>>>>>>> Starting MGIZA
>>>>>>>
>>>>>>> Initializing Global Paras
>>>>>>>
>>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>>>>>>
>>>>>>> ERROR:  Cannot open configuration file configgiza.rev!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This two files does not exists
>>>>>>>
>>>>>>> should they be generated from the installation?
>>>>>>>
>>>>>>> How to get them?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>>>> *Sent:* Sunday, October 26, 2014 6:21 PM
>>>>>>> *To:* [email protected]
>>>>>>> *Cc:* [email protected]
>>>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello Ihab,
>>>>>>>
>>>>>>> I would suggest using mgiza++. http://www.kyloo.net/software/
>>>>>>> doku.php/mgiza:overview
>>>>>>>
>>>>>>> It is very easy to use.
>>>>>>>
>>>>>>> I also wrote some scripts to make it easy for training.
>>>>>>> Visit the link below for my scripts.
>>>>>>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&;
>>>>>>> usp=sharing
>>>>>>>
>>>>>>> Usage:
>>>>>>>
>>>>>>> To train basic IBM models:
>>>>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name>
>>>>>>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation>
>>>>>>>
>>>>>>> To align 2 new files using previously trained models (aka continue
>>>>>>> training).
>>>>>>>
>>>>>>> bash align_new.sh <new_src_corpus_file_name>
>>>>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name>
>>>>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base>
>>>>>>> <path_to_mgizapp_installation>
>>>>>>>
>>>>>>> There is also a python script which you had better replace in the
>>>>>>> scripts folder of mgiza++. I have modified it to work with my scripts.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> Dear All,
>>>>>>>
>>>>>>> I just need a clear steps on how to do incremental training in
>>>>>>> moses, as the illustration in the manual is not cleared enough
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>> *Ihab Ramadan*| Senior Developer| Saudisoft
>>>>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37
>>>>>>> Ext- 0 | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image:
>>>>>>> linked]
>>>>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>*
>>>>>>>  |
>>>>>>> **[image: ZA102637861]*
>>>>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>*
>>>>>>>  |
>>>>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected]
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Raj Dabre.
>>>>>>> Research Student,
>>>>>>>
>>>>>>> Graduate School of Informatics,
>>>>>>> Kyoto University.
>>>>>>>
>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Raj Dabre.
>>>>>>> Research Student,
>>>>>>>
>>>>>>> Graduate School of Informatics,
>>>>>>> Kyoto University.
>>>>>>>
>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Raj Dabre.
>>>>>>> Research Student,
>>>>>>>
>>>>>>> Graduate School of Informatics,
>>>>>>> Kyoto University.
>>>>>>>
>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Moses-support mailing list
>>>>>>> [email protected]
>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Raj Dabre.
>>>>> Research Student,
>>>>> Graduate School of Informatics,
>>>>> Kyoto University.
>>>>> CSE MTech, IITB., 2011-2014
>>>>>
>>>>>
>>>>
>>


-- 
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Incremental training

Reply via email to