Hey, I just remembered that I have a pathetic memory. I forgot to add the lines for sorting the .vcb file in increasing order of id.
Just add the following lines to align_new.sh after the line --------- $MGIZA/scripts/plain2snt-hasvcb.py ../corpus/$4.vcb ../corpus/$3.vcb $2 $1 $2_$1.snt $1_$2.snt $2.vcb $1.vcb : sort -n $1.vcb > tmp mv tmp $1.vcb sort -n $2.vcb > tmp mv tmp $2.vcb And it will run perfectly. I am sure of it. I used your folder just to be sure. It works. Sorry for my silliness. Lemme know if it works now. Regards. On Thu, Nov 20, 2014 at 1:13 AM, Raj Dabre <[email protected]> wrote: > Well then your paths must be wrong. > I cant see why the files are not being generated. > Ill look into it tomorrow and let you know > > > On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]> > wrote: > >> When I am using your script then it has no problem. But when modified the >> lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir >> i used these two commands. >> >> sh full_train.sh org.en org.fr >> sh align_new.sh inc.en inc.fr org.en org.fr >> >> Is the above right? >> >> I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, >> NEW_CORPUS_BASE) hard-coded in the scripts. >> >> >> On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote: >> >>> Cannot open file??? >>> Does the file exist?? >>> Aee you passing the path properly? >>> >>> >>> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]> >>> wrote: >>> >>>> Hi, >>>> I made the changes based on your suggestions, its now generating a >>>> different error as below: >>>> >>>> >>>> reading vocabulary files >>>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>>> >>>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil >>>> >>>> I am attaching the working dir and the .py scripts here with. I have >>>> the 10 parallel sentences for incremental alignment is in inc_data/ where >>>> as the original 500 sentences are there in mtdata/ directory >>>> >>>> Thanks a ton for your help. >>>> >>>> Regards, >>>> sandipan >>>> >>>> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote: >>>> >>>>> Hey, >>>>> >>>>> I am pretty sure that my script does not generate duplicate token id. >>>>> >>>>> In fact, I used to get the same error till I modified the script. >>>>> >>>>> In case you do want to avoid this error and not use my script then: >>>>> >>>>> 1. Open the original python script: plain2snt-hasvcb.py >>>>> 2. There is a line which increments the id counter by 1 ( the line is >>>>> nid = len(fvcb)+1;) >>>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering >>>>> starts from 1, and thus if you have 23 tokens then the id will go from 2 >>>>> to >>>>> 24. The original update script will do: nid = 23 + 1 = 24 and the >>>>> modification will give 25 correctly). This is in 2 places: nid = >>>>> len(evcb)+2; >>>>> >>>>> Do this and it will work. >>>>> >>>>> In any case... send me a zip file of your working directory (if its >>>>> small.... you are testing it on small data right ? ). I will see what the >>>>> problem is. >>>>> >>>>> >>>>> >>>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat < >>>>> [email protected]> wrote: >>>>> >>>>>> Dear Raj, >>>>>> I also tried to use your scripts for incremental alignment. I copied >>>>>> your python script in the desired directory still I am receiving the same >>>>>> error as posted by Ihab. >>>>>> reading vocabulary files >>>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>>> 24 roi 2 >>>>>> TOKEN ID 24 has already been assigned to: roi >>>>>> >>>>>> I took only 500 sentences pairs for full_train.sh and it worked fine >>>>>> with 758 lines in the corpus/tgt_filename.vcb file >>>>>> >>>>>> I took only 10 sentences for incremental alignment_new.sh which >>>>>> generated the error and I found 8054 lines in the >>>>>> new_corpus/new_tgt_file.vcb >>>>>> Is there any problem? Can you please help me on the same. >>>>>> >>>>>> Thanks and regards, >>>>>> sandipan >>>>>> >>>>>> >>>>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote: >>>>>> >>>>>>> Dear Ihab. >>>>>>> There is a python script that was there in the google drive folder >>>>>>> in the first mail I sent you. >>>>>>> Please replace the existing file with my copy. >>>>>>> >>>>>>> It has to work. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> >>>>>>> Sent from Samsung Mobile >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------- Original message -------- >>>>>>> From: Ihab Ramadan <[email protected]> >>>>>>> Date: 05/11/2014 00:54 (GMT+09:00) >>>>>>> To: 'Raj Dabre' <[email protected]> >>>>>>> Cc: [email protected] >>>>>>> Subject: RE: [Moses-support] Incremental training >>>>>>> >>>>>>> >>>>>>> Dear Raj, >>>>>>> >>>>>>> Your point is clear and I try to follow the steps you mentioned but >>>>>>> I stuck now in the align_new.sh script which gives me this error >>>>>>> >>>>>>> reading vocabulary files >>>>>>> >>>>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb >>>>>>> >>>>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>>>> >>>>>>> 29107 q-1 4 >>>>>>> >>>>>>> Do you have any idea what this error means? >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM >>>>>>> *To:* [email protected] >>>>>>> *Cc:* [email protected] >>>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dear Ihab, >>>>>>> >>>>>>> Perhaps I should have mentioned much more clearly what my script >>>>>>> does. Sorry for that. >>>>>>> >>>>>>> Let me start with this: There is no direct/easy way to generate the >>>>>>> moses.ini file as you need. >>>>>>> >>>>>>> 1. Suppose you have 2 million lines of parallel corpora and you >>>>>>> trained a SMT system for it. This naturally gives the phrase table, >>>>>>> reordering table and moses.ini. >>>>>>> >>>>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there >>>>>>> are 2 ways: >>>>>>> >>>>>>> a. Retrain 2.5 million lines from scratch (will take lots of >>>>>>> time: ~ 2-3 days on a regular machines) >>>>>>> >>>>>>> b. Train on only the 500k new lines using the alignment >>>>>>> information of the original training data. (Faster: ~ 6-7 hours). >>>>>>> >>>>>>> >>>>>>> >>>>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE >>>>>>> TABLES.* >>>>>>> >>>>>>> 1. full_train.sh -------------- This trains on the original corpus >>>>>>> of 2 million lines. (Generate alignment files only for the original >>>>>>> corpus) >>>>>>> >>>>>>> 2. align_new.sh -------------- This trains on the new corpus of 500 >>>>>>> k lines. (Generate alignment files only for the new corpus using the >>>>>>> alignments for 1) >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Why this split ????* Because the basic training step of Moses does >>>>>>> not preserve the alignment probability information. Only the alignments >>>>>>> are >>>>>>> saved. To continue training we need the probability information. >>>>>>> >>>>>>> You can pass flags to moses to preserve this information ( this >>>>>>> flag is --giza-option . If you do this then you will not need >>>>>>> full_train.sh. But you will have to change the config files before using >>>>>>> align_new.sh) >>>>>>> >>>>>>> *HOW TO GET UPDATED PHRASE TABLE:* >>>>>>> >>>>>>> 1. Append the forward alignments (fwd) generated by align_new.sh to >>>>>>> the forward (fwd) alignments generated by full_train.sh. >>>>>>> 2. Append the inverse alignments (inv) generated by align_new.sh to >>>>>>> the inverse (inv) alignments generated by full_train.sh. >>>>>>> >>>>>>> 3. Run the moses training script with additional flags: >>>>>>> >>>>>>> - --first-step -- first step in the training process (default >>>>>>> 1)--------------- This will be 4 >>>>>>> - --last-step -- last step in the training process (default >>>>>>> 7)------------ This will remain 7 >>>>>>> - --giza-f2e -- <path to folder>/new_giza.fwd >>>>>>> - --giza-e2f -- <path to folder>/new_giza.inv >>>>>>> >>>>>>> For example: >>>>>>> >>>>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your >>>>>>> training directory> \ >>>>>>> >>>>>>> -corpus <your new corpus name> \ >>>>>>> >>>>>>> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering >>>>>>> msd-bidirectional-fe \ >>>>>>> >>>>>>> -lm 0:3:<path to LM>:8 \ >>>>>>> --first-step 4 --last-step 7 --giza-f2e -- <path to >>>>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \ >>>>>>> -external-bin-dir <path to giza++ binaries> >>>>>>> >>>>>>> For more details on the training step read this: >>>>>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters >>>>>>> >>>>>>> What this does is assumes that you have alignments and continue the >>>>>>> phrase extraction, reordering and generate the new moses.ini file. >>>>>>> >>>>>>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* >>>>>>> >>>>>>> >>>>>>> >>>>>>> If you are still unclear then please ask and I will try to help you >>>>>>> as much as I can. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Dear Raj, >>>>>>> >>>>>>> That’s a great work my friend, >>>>>>> >>>>>>> This files make the script work but it takes long time to finish >>>>>>> also it did not generate the model folder which contain the moses.ini >>>>>>> file >>>>>>> >>>>>>> Is this normal? >>>>>>> >>>>>>> And I now try to run it again as I suspect that the server was shut >>>>>>> down before the training was completed but i notice that it starts form >>>>>>> the >>>>>>> beginning and did not use the existing files generated >>>>>>> >>>>>>> Thanks Raj it still a great work >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>>> *Sent:* Thursday, October 30, 2014 4:54 PM >>>>>>> >>>>>>> >>>>>>> *To:* [email protected] >>>>>>> *Cc:* [email protected] >>>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>>> >>>>>>> >>>>>>> >>>>>>> Ahh.... i totally forgot that part. >>>>>>> >>>>>>> Sorry. >>>>>>> >>>>>>> PFA. >>>>>>> >>>>>>> Just place them in the folder where the shell scripts full_train.sh >>>>>>> and align_new.sh are. >>>>>>> >>>>>>> Hopefully it should run now. >>>>>>> >>>>>>> Please let me know if you succeed. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Dear Raj, >>>>>>> >>>>>>> It is a great solution >>>>>>> >>>>>>> I installed MGIZA++ successfully and I am using your scripts to run >>>>>>> training >>>>>>> >>>>>>> And I followed the steps you mentioned but I faces this error when I >>>>>>> was running the full_train.sh script >>>>>>> >>>>>>> >>>>>>> >>>>>>> bla bla bla >>>>>>> >>>>>>> . >>>>>>> >>>>>>> . >>>>>>> >>>>>>> . >>>>>>> >>>>>>> . >>>>>>> >>>>>>> >>>>>>> >>>>>>> Starting MGIZA >>>>>>> >>>>>>> Initializing Global Paras >>>>>>> >>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>>>> >>>>>>> ERROR: Cannot open configuration file configgiza.fwd! >>>>>>> >>>>>>> Starting MGIZA >>>>>>> >>>>>>> Initializing Global Paras >>>>>>> >>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>>>> >>>>>>> ERROR: Cannot open configuration file configgiza.rev! >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This two files does not exists >>>>>>> >>>>>>> should they be generated from the installation? >>>>>>> >>>>>>> How to get them? >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>>> *Sent:* Sunday, October 26, 2014 6:21 PM >>>>>>> *To:* [email protected] >>>>>>> *Cc:* [email protected] >>>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello Ihab, >>>>>>> >>>>>>> I would suggest using mgiza++. http://www.kyloo.net/software/ >>>>>>> doku.php/mgiza:overview >>>>>>> >>>>>>> It is very easy to use. >>>>>>> >>>>>>> I also wrote some scripts to make it easy for training. >>>>>>> Visit the link below for my scripts. >>>>>>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M& >>>>>>> usp=sharing >>>>>>> >>>>>>> Usage: >>>>>>> >>>>>>> To train basic IBM models: >>>>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> >>>>>>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation> >>>>>>> >>>>>>> To align 2 new files using previously trained models (aka continue >>>>>>> training). >>>>>>> >>>>>>> bash align_new.sh <new_src_corpus_file_name> >>>>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name> >>>>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base> >>>>>>> <path_to_mgizapp_installation> >>>>>>> >>>>>>> There is also a python script which you had better replace in the >>>>>>> scripts folder of mgiza++. I have modified it to work with my scripts. >>>>>>> >>>>>>> Hope this helps. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>> Dear All, >>>>>>> >>>>>>> I just need a clear steps on how to do incremental training in >>>>>>> moses, as the illustration in the manual is not cleared enough >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> >>>>>>> Best Regards >>>>>>> >>>>>>> *Ihab Ramadan*| Senior Developer| Saudisoft >>>>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37 >>>>>>> Ext- 0 | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image: >>>>>>> linked] >>>>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >>>>>>> | >>>>>>> **[image: ZA102637861]* >>>>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >>>>>>> | >>>>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> [email protected] >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Raj Dabre. >>>>>>> Research Student, >>>>>>> >>>>>>> Graduate School of Informatics, >>>>>>> Kyoto University. >>>>>>> >>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Raj Dabre. >>>>>>> Research Student, >>>>>>> >>>>>>> Graduate School of Informatics, >>>>>>> Kyoto University. >>>>>>> >>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Raj Dabre. >>>>>>> Research Student, >>>>>>> >>>>>>> Graduate School of Informatics, >>>>>>> Kyoto University. >>>>>>> >>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Moses-support mailing list >>>>>>> [email protected] >>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Raj Dabre. >>>>> Research Student, >>>>> Graduate School of Informatics, >>>>> Kyoto University. >>>>> CSE MTech, IITB., 2011-2014 >>>>> >>>>> >>>> >> -- Raj Dabre. Research Student, Graduate School of Informatics, Kyoto University. CSE MTech, IITB., 2011-2014
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
