Hey, I am pretty sure that my script does not generate duplicate token id.
In fact, I used to get the same error till I modified the script. In case you do want to avoid this error and not use my script then: 1. Open the original python script: plain2snt-hasvcb.py 2. There is a line which increments the id counter by 1 ( the line is nid = len(fvcb)+1;) 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering starts from 1, and thus if you have 23 tokens then the id will go from 2 to 24. The original update script will do: nid = 23 + 1 = 24 and the modification will give 25 correctly). This is in 2 places: nid = len(evcb)+2; Do this and it will work. In any case... send me a zip file of your working directory (if its small.... you are testing it on small data right ? ). I will see what the problem is. On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat < [email protected]> wrote: > Dear Raj, > I also tried to use your scripts for incremental alignment. I copied your > python script in the desired directory still I am receiving the same error > as posted by Ihab. > reading vocabulary files > Reading vocabulary file from:new_corpus/inc.fr.vcb > ERROR: TOKEN ID must be unique for each token, in line : > 24 roi 2 > TOKEN ID 24 has already been assigned to: roi > > I took only 500 sentences pairs for full_train.sh and it worked fine with > 758 lines in the corpus/tgt_filename.vcb file > > I took only 10 sentences for incremental alignment_new.sh which generated > the error and I found 8054 lines in the new_corpus/new_tgt_file.vcb > Is there any problem? Can you please help me on the same. > > Thanks and regards, > sandipan > > > On 4 November 2014 16:13, prajdabre <[email protected]> wrote: > >> Dear Ihab. >> There is a python script that was there in the google drive folder in the >> first mail I sent you. >> Please replace the existing file with my copy. >> >> It has to work. >> >> Regards. >> >> >> Sent from Samsung Mobile >> >> >> >> -------- Original message -------- >> From: Ihab Ramadan <[email protected]> >> Date: 05/11/2014 00:54 (GMT+09:00) >> To: 'Raj Dabre' <[email protected]> >> Cc: [email protected] >> Subject: RE: [Moses-support] Incremental training >> >> >> Dear Raj, >> >> Your point is clear and I try to follow the steps you mentioned but I >> stuck now in the align_new.sh script which gives me this error >> >> reading vocabulary files >> >> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb >> >> ERROR: TOKEN ID must be unique for each token, in line : >> >> 29107 q-1 4 >> >> Do you have any idea what this error means? >> >> >> >> *From:* Raj Dabre [mailto:[email protected]] >> *Sent:* Tuesday, November 4, 2014 12:06 PM >> *To:* [email protected] >> *Cc:* [email protected] >> *Subject:* Re: [Moses-support] Incremental training >> >> >> >> Dear Ihab, >> >> Perhaps I should have mentioned much more clearly what my script does. >> Sorry for that. >> >> Let me start with this: There is no direct/easy way to generate the >> moses.ini file as you need. >> >> 1. Suppose you have 2 million lines of parallel corpora and you trained a >> SMT system for it. This naturally gives the phrase table, reordering table >> and moses.ini. >> >> 2. Suppose you got 500 k more lines of parallel corpora.... there are 2 >> ways: >> >> a. Retrain 2.5 million lines from scratch (will take lots of time: ~ >> 2-3 days on a regular machines) >> >> b. Train on only the 500k new lines using the alignment information >> of the original training data. (Faster: ~ 6-7 hours). >> >> >> >> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE >> TABLES.* >> >> 1. full_train.sh -------------- This trains on the original corpus of 2 >> million lines. (Generate alignment files only for the original corpus) >> >> 2. align_new.sh -------------- This trains on the new corpus of 500 k >> lines. (Generate alignment files only for the new corpus using the >> alignments for 1) >> >> >> >> *Why this split ????* Because the basic training step of Moses does not >> preserve the alignment probability information. Only the alignments are >> saved. To continue training we need the probability information. >> >> You can pass flags to moses to preserve this information ( this flag is >> --giza-option . If you do this then you will not need full_train.sh. But >> you will have to change the config files before using align_new.sh) >> >> *HOW TO GET UPDATED PHRASE TABLE:* >> >> 1. Append the forward alignments (fwd) generated by align_new.sh to the >> forward (fwd) alignments generated by full_train.sh. >> 2. Append the inverse alignments (inv) generated by align_new.sh to the >> inverse (inv) alignments generated by full_train.sh. >> >> 3. Run the moses training script with additional flags: >> >> - --first-step -- first step in the training process (default >> 1)--------------- This will be 4 >> - --last-step -- last step in the training process (default >> 7)------------ This will remain 7 >> - --giza-f2e -- <path to folder>/new_giza.fwd >> - --giza-e2f -- <path to folder>/new_giza.inv >> >> For example: >> >> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training >> directory> \ >> >> -corpus <your new corpus name> \ >> >> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering >> msd-bidirectional-fe \ >> >> -lm 0:3:<path to LM>:8 \ >> --first-step 4 --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd >> --giza-e2f -- <path to folder>/new_giza.inv \ >> -external-bin-dir <path to giza++ binaries> >> >> For more details on the training step read this: >> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters >> >> What this does is assumes that you have alignments and continue the >> phrase extraction, reordering and generate the new moses.ini file. >> >> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* >> >> >> >> If you are still unclear then please ask and I will try to help you as >> much as I can. >> >> Regards. >> >> >> >> >> >> >> >> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <[email protected]> >> wrote: >> >> Dear Raj, >> >> That’s a great work my friend, >> >> This files make the script work but it takes long time to finish also it >> did not generate the model folder which contain the moses.ini file >> >> Is this normal? >> >> And I now try to run it again as I suspect that the server was shut down >> before the training was completed but i notice that it starts form the >> beginning and did not use the existing files generated >> >> Thanks Raj it still a great work >> >> >> >> >> >> *From:* Raj Dabre [mailto:[email protected]] >> *Sent:* Thursday, October 30, 2014 4:54 PM >> >> >> *To:* [email protected] >> *Cc:* [email protected] >> *Subject:* Re: [Moses-support] Incremental training >> >> >> >> Ahh.... i totally forgot that part. >> >> Sorry. >> >> PFA. >> >> Just place them in the folder where the shell scripts full_train.sh and >> align_new.sh are. >> >> Hopefully it should run now. >> >> Please let me know if you succeed. >> >> >> >> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <[email protected]> >> wrote: >> >> Dear Raj, >> >> It is a great solution >> >> I installed MGIZA++ successfully and I am using your scripts to run >> training >> >> And I followed the steps you mentioned but I faces this error when I was >> running the full_train.sh script >> >> >> >> bla bla bla >> >> . >> >> . >> >> . >> >> . >> >> >> >> Starting MGIZA >> >> Initializing Global Paras >> >> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >> >> ERROR: Cannot open configuration file configgiza.fwd! >> >> Starting MGIZA >> >> Initializing Global Paras >> >> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >> >> ERROR: Cannot open configuration file configgiza.rev! >> >> >> >> >> >> This two files does not exists >> >> should they be generated from the installation? >> >> How to get them? >> >> >> >> *From:* Raj Dabre [mailto:[email protected]] >> *Sent:* Sunday, October 26, 2014 6:21 PM >> *To:* [email protected] >> *Cc:* [email protected] >> *Subject:* Re: [Moses-support] Incremental training >> >> >> >> Hello Ihab, >> >> I would suggest using mgiza++. >> http://www.kyloo.net/software/doku.php/mgiza:overview >> >> It is very easy to use. >> >> I also wrote some scripts to make it easy for training. >> Visit the link below for my scripts. >> >> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing >> >> Usage: >> >> To train basic IBM models: >> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> >> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation> >> >> To align 2 new files using previously trained models (aka continue >> training). >> >> bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name> >> <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base> >> <corpus_folder_base> <path_to_mgizapp_installation> >> >> There is also a python script which you had better replace in the scripts >> folder of mgiza++. I have modified it to work with my scripts. >> >> Hope this helps. >> >> >> >> >> >> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <[email protected]> >> wrote: >> >> Dear All, >> >> I just need a clear steps on how to do incremental training in moses, as >> the illustration in the manual is not cleared enough >> >> Thanks >> >> >> >> Best Regards >> >> *Ihab Ramadan*| Senior Developer| Saudisoft <http://www.saudisoft.com/> >> - Egypt | *Tel * +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax >> +20233032036 | *Follow us on *[image: linked] >> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >> | >> **[image: ZA102637861]* >> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >> | >> **[image: ZA102637858]* <https://twitter.com/Saudisoft> >> >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> >> >> -- >> >> Raj Dabre. >> Research Student, >> >> Graduate School of Informatics, >> Kyoto University. >> >> CSE MTech, IITB., 2011-2014 >> >> >> >> >> -- >> >> Raj Dabre. >> Research Student, >> >> Graduate School of Informatics, >> Kyoto University. >> >> CSE MTech, IITB., 2011-2014 >> >> >> >> >> -- >> >> Raj Dabre. >> Research Student, >> >> Graduate School of Informatics, >> Kyoto University. >> >> CSE MTech, IITB., 2011-2014 >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > -- Raj Dabre. Research Student, Graduate School of Informatics, Kyoto University. CSE MTech, IITB., 2011-2014
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
