Well then your paths must be wrong. I cant see why the files are not being generated. Ill look into it tomorrow and let you know
On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]> wrote: > When I am using your script then it has no problem. But when modified the > lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir > i used these two commands. > > sh full_train.sh org.en org.fr > sh align_new.sh inc.en inc.fr org.en org.fr > > Is the above right? > > I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE) > hard-coded in the scripts. > > > On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote: > >> Cannot open file??? >> Does the file exist?? >> Aee you passing the path properly? >> >> >> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]> >> wrote: >> >>> Hi, >>> I made the changes based on your suggestions, its now generating a >>> different error as below: >>> >>> >>> reading vocabulary files >>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>> >>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil >>> >>> I am attaching the working dir and the .py scripts here with. I have the >>> 10 parallel sentences for incremental alignment is in inc_data/ where as >>> the original 500 sentences are there in mtdata/ directory >>> >>> Thanks a ton for your help. >>> >>> Regards, >>> sandipan >>> >>> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote: >>> >>>> Hey, >>>> >>>> I am pretty sure that my script does not generate duplicate token id. >>>> >>>> In fact, I used to get the same error till I modified the script. >>>> >>>> In case you do want to avoid this error and not use my script then: >>>> >>>> 1. Open the original python script: plain2snt-hasvcb.py >>>> 2. There is a line which increments the id counter by 1 ( the line is >>>> nid = len(fvcb)+1;) >>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering >>>> starts from 1, and thus if you have 23 tokens then the id will go from 2 to >>>> 24. The original update script will do: nid = 23 + 1 = 24 and the >>>> modification will give 25 correctly). This is in 2 places: nid = >>>> len(evcb)+2; >>>> >>>> Do this and it will work. >>>> >>>> In any case... send me a zip file of your working directory (if its >>>> small.... you are testing it on small data right ? ). I will see what the >>>> problem is. >>>> >>>> >>>> >>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat < >>>> [email protected]> wrote: >>>> >>>>> Dear Raj, >>>>> I also tried to use your scripts for incremental alignment. I copied >>>>> your python script in the desired directory still I am receiving the same >>>>> error as posted by Ihab. >>>>> reading vocabulary files >>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>> 24 roi 2 >>>>> TOKEN ID 24 has already been assigned to: roi >>>>> >>>>> I took only 500 sentences pairs for full_train.sh and it worked fine >>>>> with 758 lines in the corpus/tgt_filename.vcb file >>>>> >>>>> I took only 10 sentences for incremental alignment_new.sh which >>>>> generated the error and I found 8054 lines in the >>>>> new_corpus/new_tgt_file.vcb >>>>> Is there any problem? Can you please help me on the same. >>>>> >>>>> Thanks and regards, >>>>> sandipan >>>>> >>>>> >>>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote: >>>>> >>>>>> Dear Ihab. >>>>>> There is a python script that was there in the google drive folder in >>>>>> the first mail I sent you. >>>>>> Please replace the existing file with my copy. >>>>>> >>>>>> It has to work. >>>>>> >>>>>> Regards. >>>>>> >>>>>> >>>>>> Sent from Samsung Mobile >>>>>> >>>>>> >>>>>> >>>>>> -------- Original message -------- >>>>>> From: Ihab Ramadan <[email protected]> >>>>>> Date: 05/11/2014 00:54 (GMT+09:00) >>>>>> To: 'Raj Dabre' <[email protected]> >>>>>> Cc: [email protected] >>>>>> Subject: RE: [Moses-support] Incremental training >>>>>> >>>>>> >>>>>> Dear Raj, >>>>>> >>>>>> Your point is clear and I try to follow the steps you mentioned but I >>>>>> stuck now in the align_new.sh script which gives me this error >>>>>> >>>>>> reading vocabulary files >>>>>> >>>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb >>>>>> >>>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>>> >>>>>> 29107 q-1 4 >>>>>> >>>>>> Do you have any idea what this error means? >>>>>> >>>>>> >>>>>> >>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM >>>>>> *To:* [email protected] >>>>>> *Cc:* [email protected] >>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>> >>>>>> >>>>>> >>>>>> Dear Ihab, >>>>>> >>>>>> Perhaps I should have mentioned much more clearly what my script >>>>>> does. Sorry for that. >>>>>> >>>>>> Let me start with this: There is no direct/easy way to generate the >>>>>> moses.ini file as you need. >>>>>> >>>>>> 1. Suppose you have 2 million lines of parallel corpora and you >>>>>> trained a SMT system for it. This naturally gives the phrase table, >>>>>> reordering table and moses.ini. >>>>>> >>>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there are >>>>>> 2 ways: >>>>>> >>>>>> a. Retrain 2.5 million lines from scratch (will take lots of >>>>>> time: ~ 2-3 days on a regular machines) >>>>>> >>>>>> b. Train on only the 500k new lines using the alignment >>>>>> information of the original training data. (Faster: ~ 6-7 hours). >>>>>> >>>>>> >>>>>> >>>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE >>>>>> TABLES.* >>>>>> >>>>>> 1. full_train.sh -------------- This trains on the original corpus of >>>>>> 2 million lines. (Generate alignment files only for the original corpus) >>>>>> >>>>>> 2. align_new.sh -------------- This trains on the new corpus of 500 k >>>>>> lines. (Generate alignment files only for the new corpus using the >>>>>> alignments for 1) >>>>>> >>>>>> >>>>>> >>>>>> *Why this split ????* Because the basic training step of Moses does >>>>>> not preserve the alignment probability information. Only the alignments >>>>>> are >>>>>> saved. To continue training we need the probability information. >>>>>> >>>>>> You can pass flags to moses to preserve this information ( this flag >>>>>> is --giza-option . If you do this then you will not need >>>>>> full_train.sh. But you will have to change the config files before using >>>>>> align_new.sh) >>>>>> >>>>>> *HOW TO GET UPDATED PHRASE TABLE:* >>>>>> >>>>>> 1. Append the forward alignments (fwd) generated by align_new.sh to >>>>>> the forward (fwd) alignments generated by full_train.sh. >>>>>> 2. Append the inverse alignments (inv) generated by align_new.sh to >>>>>> the inverse (inv) alignments generated by full_train.sh. >>>>>> >>>>>> 3. Run the moses training script with additional flags: >>>>>> >>>>>> - --first-step -- first step in the training process (default >>>>>> 1)--------------- This will be 4 >>>>>> - --last-step -- last step in the training process (default >>>>>> 7)------------ This will remain 7 >>>>>> - --giza-f2e -- <path to folder>/new_giza.fwd >>>>>> - --giza-e2f -- <path to folder>/new_giza.inv >>>>>> >>>>>> For example: >>>>>> >>>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your >>>>>> training directory> \ >>>>>> >>>>>> -corpus <your new corpus name> \ >>>>>> >>>>>> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering >>>>>> msd-bidirectional-fe \ >>>>>> >>>>>> -lm 0:3:<path to LM>:8 \ >>>>>> --first-step 4 --last-step 7 --giza-f2e -- <path to >>>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \ >>>>>> -external-bin-dir <path to giza++ binaries> >>>>>> >>>>>> For more details on the training step read this: >>>>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters >>>>>> >>>>>> What this does is assumes that you have alignments and continue the >>>>>> phrase extraction, reordering and generate the new moses.ini file. >>>>>> >>>>>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* >>>>>> >>>>>> >>>>>> >>>>>> If you are still unclear then please ask and I will try to help you >>>>>> as much as I can. >>>>>> >>>>>> Regards. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Dear Raj, >>>>>> >>>>>> That’s a great work my friend, >>>>>> >>>>>> This files make the script work but it takes long time to finish also >>>>>> it did not generate the model folder which contain the moses.ini file >>>>>> >>>>>> Is this normal? >>>>>> >>>>>> And I now try to run it again as I suspect that the server was shut >>>>>> down before the training was completed but i notice that it starts form >>>>>> the >>>>>> beginning and did not use the existing files generated >>>>>> >>>>>> Thanks Raj it still a great work >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>> *Sent:* Thursday, October 30, 2014 4:54 PM >>>>>> >>>>>> >>>>>> *To:* [email protected] >>>>>> *Cc:* [email protected] >>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>> >>>>>> >>>>>> >>>>>> Ahh.... i totally forgot that part. >>>>>> >>>>>> Sorry. >>>>>> >>>>>> PFA. >>>>>> >>>>>> Just place them in the folder where the shell scripts full_train.sh >>>>>> and align_new.sh are. >>>>>> >>>>>> Hopefully it should run now. >>>>>> >>>>>> Please let me know if you succeed. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Dear Raj, >>>>>> >>>>>> It is a great solution >>>>>> >>>>>> I installed MGIZA++ successfully and I am using your scripts to run >>>>>> training >>>>>> >>>>>> And I followed the steps you mentioned but I faces this error when I >>>>>> was running the full_train.sh script >>>>>> >>>>>> >>>>>> >>>>>> bla bla bla >>>>>> >>>>>> . >>>>>> >>>>>> . >>>>>> >>>>>> . >>>>>> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>>> Starting MGIZA >>>>>> >>>>>> Initializing Global Paras >>>>>> >>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>>> >>>>>> ERROR: Cannot open configuration file configgiza.fwd! >>>>>> >>>>>> Starting MGIZA >>>>>> >>>>>> Initializing Global Paras >>>>>> >>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>>> >>>>>> ERROR: Cannot open configuration file configgiza.rev! >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> This two files does not exists >>>>>> >>>>>> should they be generated from the installation? >>>>>> >>>>>> How to get them? >>>>>> >>>>>> >>>>>> >>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>> *Sent:* Sunday, October 26, 2014 6:21 PM >>>>>> *To:* [email protected] >>>>>> *Cc:* [email protected] >>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>> >>>>>> >>>>>> >>>>>> Hello Ihab, >>>>>> >>>>>> I would suggest using mgiza++. >>>>>> http://www.kyloo.net/software/doku.php/mgiza:overview >>>>>> >>>>>> It is very easy to use. >>>>>> >>>>>> I also wrote some scripts to make it easy for training. >>>>>> Visit the link below for my scripts. >>>>>> >>>>>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing >>>>>> >>>>>> Usage: >>>>>> >>>>>> To train basic IBM models: >>>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> >>>>>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation> >>>>>> >>>>>> To align 2 new files using previously trained models (aka continue >>>>>> training). >>>>>> >>>>>> bash align_new.sh <new_src_corpus_file_name> >>>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name> >>>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base> >>>>>> <path_to_mgizapp_installation> >>>>>> >>>>>> There is also a python script which you had better replace in the >>>>>> scripts folder of mgiza++. I have modified it to work with my scripts. >>>>>> >>>>>> Hope this helps. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan < >>>>>> [email protected]> wrote: >>>>>> >>>>>> Dear All, >>>>>> >>>>>> I just need a clear steps on how to do incremental training in moses, >>>>>> as the illustration in the manual is not cleared enough >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> >>>>>> Best Regards >>>>>> >>>>>> *Ihab Ramadan*| Senior Developer| Saudisoft >>>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37 Ext- 0 >>>>>> | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image: linked] >>>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >>>>>> | >>>>>> **[image: ZA102637861]* >>>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >>>>>> | >>>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Raj Dabre. >>>>>> Research Student, >>>>>> >>>>>> Graduate School of Informatics, >>>>>> Kyoto University. >>>>>> >>>>>> CSE MTech, IITB., 2011-2014 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Raj Dabre. >>>>>> Research Student, >>>>>> >>>>>> Graduate School of Informatics, >>>>>> Kyoto University. >>>>>> >>>>>> CSE MTech, IITB., 2011-2014 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Raj Dabre. >>>>>> Research Student, >>>>>> >>>>>> Graduate School of Informatics, >>>>>> Kyoto University. >>>>>> >>>>>> CSE MTech, IITB., 2011-2014 >>>>>> >>>>>> _______________________________________________ >>>>>> Moses-support mailing list >>>>>> [email protected] >>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Raj Dabre. >>>> Research Student, >>>> Graduate School of Informatics, >>>> Kyoto University. >>>> CSE MTech, IITB., 2011-2014 >>>> >>>> >>> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
