Cannot open file??? Does the file exist?? Aee you passing the path properly?
On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]> wrote: > Hi, > I made the changes based on your suggestions, its now generating a > different error as below: > > > reading vocabulary files > Reading vocabulary file from:new_corpus/inc.fr.vcb > > Cannot open vocabulary file new_corpus/inc.fr.vcbfil > > I am attaching the working dir and the .py scripts here with. I have the > 10 parallel sentences for incremental alignment is in inc_data/ where as > the original 500 sentences are there in mtdata/ directory > > Thanks a ton for your help. > > Regards, > sandipan > > On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote: > >> Hey, >> >> I am pretty sure that my script does not generate duplicate token id. >> >> In fact, I used to get the same error till I modified the script. >> >> In case you do want to avoid this error and not use my script then: >> >> 1. Open the original python script: plain2snt-hasvcb.py >> 2. There is a line which increments the id counter by 1 ( the line is nid >> = len(fvcb)+1;) >> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering >> starts from 1, and thus if you have 23 tokens then the id will go from 2 to >> 24. The original update script will do: nid = 23 + 1 = 24 and the >> modification will give 25 correctly). This is in 2 places: nid = >> len(evcb)+2; >> >> Do this and it will work. >> >> In any case... send me a zip file of your working directory (if its >> small.... you are testing it on small data right ? ). I will see what the >> problem is. >> >> >> >> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat < >> [email protected]> wrote: >> >>> Dear Raj, >>> I also tried to use your scripts for incremental alignment. I copied >>> your python script in the desired directory still I am receiving the same >>> error as posted by Ihab. >>> reading vocabulary files >>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>> ERROR: TOKEN ID must be unique for each token, in line : >>> 24 roi 2 >>> TOKEN ID 24 has already been assigned to: roi >>> >>> I took only 500 sentences pairs for full_train.sh and it worked fine >>> with 758 lines in the corpus/tgt_filename.vcb file >>> >>> I took only 10 sentences for incremental alignment_new.sh which >>> generated the error and I found 8054 lines in the >>> new_corpus/new_tgt_file.vcb >>> Is there any problem? Can you please help me on the same. >>> >>> Thanks and regards, >>> sandipan >>> >>> >>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote: >>> >>>> Dear Ihab. >>>> There is a python script that was there in the google drive folder in >>>> the first mail I sent you. >>>> Please replace the existing file with my copy. >>>> >>>> It has to work. >>>> >>>> Regards. >>>> >>>> >>>> Sent from Samsung Mobile >>>> >>>> >>>> >>>> -------- Original message -------- >>>> From: Ihab Ramadan <[email protected]> >>>> Date: 05/11/2014 00:54 (GMT+09:00) >>>> To: 'Raj Dabre' <[email protected]> >>>> Cc: [email protected] >>>> Subject: RE: [Moses-support] Incremental training >>>> >>>> >>>> Dear Raj, >>>> >>>> Your point is clear and I try to follow the steps you mentioned but I >>>> stuck now in the align_new.sh script which gives me this error >>>> >>>> reading vocabulary files >>>> >>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb >>>> >>>> ERROR: TOKEN ID must be unique for each token, in line : >>>> >>>> 29107 q-1 4 >>>> >>>> Do you have any idea what this error means? >>>> >>>> >>>> >>>> *From:* Raj Dabre [mailto:[email protected]] >>>> *Sent:* Tuesday, November 4, 2014 12:06 PM >>>> *To:* [email protected] >>>> *Cc:* [email protected] >>>> *Subject:* Re: [Moses-support] Incremental training >>>> >>>> >>>> >>>> Dear Ihab, >>>> >>>> Perhaps I should have mentioned much more clearly what my script does. >>>> Sorry for that. >>>> >>>> Let me start with this: There is no direct/easy way to generate the >>>> moses.ini file as you need. >>>> >>>> 1. Suppose you have 2 million lines of parallel corpora and you trained >>>> a SMT system for it. This naturally gives the phrase table, reordering >>>> table and moses.ini. >>>> >>>> 2. Suppose you got 500 k more lines of parallel corpora.... there are 2 >>>> ways: >>>> >>>> a. Retrain 2.5 million lines from scratch (will take lots of time: >>>> ~ 2-3 days on a regular machines) >>>> >>>> b. Train on only the 500k new lines using the alignment information >>>> of the original training data. (Faster: ~ 6-7 hours). >>>> >>>> >>>> >>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE >>>> TABLES.* >>>> >>>> 1. full_train.sh -------------- This trains on the original corpus of 2 >>>> million lines. (Generate alignment files only for the original corpus) >>>> >>>> 2. align_new.sh -------------- This trains on the new corpus of 500 k >>>> lines. (Generate alignment files only for the new corpus using the >>>> alignments for 1) >>>> >>>> >>>> >>>> *Why this split ????* Because the basic training step of Moses does >>>> not preserve the alignment probability information. Only the alignments are >>>> saved. To continue training we need the probability information. >>>> >>>> You can pass flags to moses to preserve this information ( this flag >>>> is --giza-option . If you do this then you will not need >>>> full_train.sh. But you will have to change the config files before using >>>> align_new.sh) >>>> >>>> *HOW TO GET UPDATED PHRASE TABLE:* >>>> >>>> 1. Append the forward alignments (fwd) generated by align_new.sh to the >>>> forward (fwd) alignments generated by full_train.sh. >>>> 2. Append the inverse alignments (inv) generated by align_new.sh to the >>>> inverse (inv) alignments generated by full_train.sh. >>>> >>>> 3. Run the moses training script with additional flags: >>>> >>>> - --first-step -- first step in the training process (default >>>> 1)--------------- This will be 4 >>>> - --last-step -- last step in the training process (default >>>> 7)------------ This will remain 7 >>>> - --giza-f2e -- <path to folder>/new_giza.fwd >>>> - --giza-e2f -- <path to folder>/new_giza.inv >>>> >>>> For example: >>>> >>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training >>>> directory> \ >>>> >>>> -corpus <your new corpus name> \ >>>> >>>> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering >>>> msd-bidirectional-fe \ >>>> >>>> -lm 0:3:<path to LM>:8 \ >>>> --first-step 4 --last-step 7 --giza-f2e -- <path to folder>/new_giza.fwd >>>> --giza-e2f -- <path to folder>/new_giza.inv \ >>>> -external-bin-dir <path to giza++ binaries> >>>> >>>> For more details on the training step read this: >>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters >>>> >>>> What this does is assumes that you have alignments and continue the >>>> phrase extraction, reordering and generate the new moses.ini file. >>>> >>>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* >>>> >>>> >>>> >>>> If you are still unclear then please ask and I will try to help you as >>>> much as I can. >>>> >>>> Regards. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <[email protected]> >>>> wrote: >>>> >>>> Dear Raj, >>>> >>>> That’s a great work my friend, >>>> >>>> This files make the script work but it takes long time to finish also >>>> it did not generate the model folder which contain the moses.ini file >>>> >>>> Is this normal? >>>> >>>> And I now try to run it again as I suspect that the server was shut >>>> down before the training was completed but i notice that it starts form the >>>> beginning and did not use the existing files generated >>>> >>>> Thanks Raj it still a great work >>>> >>>> >>>> >>>> >>>> >>>> *From:* Raj Dabre [mailto:[email protected]] >>>> *Sent:* Thursday, October 30, 2014 4:54 PM >>>> >>>> >>>> *To:* [email protected] >>>> *Cc:* [email protected] >>>> *Subject:* Re: [Moses-support] Incremental training >>>> >>>> >>>> >>>> Ahh.... i totally forgot that part. >>>> >>>> Sorry. >>>> >>>> PFA. >>>> >>>> Just place them in the folder where the shell scripts full_train.sh and >>>> align_new.sh are. >>>> >>>> Hopefully it should run now. >>>> >>>> Please let me know if you succeed. >>>> >>>> >>>> >>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <[email protected]> >>>> wrote: >>>> >>>> Dear Raj, >>>> >>>> It is a great solution >>>> >>>> I installed MGIZA++ successfully and I am using your scripts to run >>>> training >>>> >>>> And I followed the steps you mentioned but I faces this error when I >>>> was running the full_train.sh script >>>> >>>> >>>> >>>> bla bla bla >>>> >>>> . >>>> >>>> . >>>> >>>> . >>>> >>>> . >>>> >>>> >>>> >>>> Starting MGIZA >>>> >>>> Initializing Global Paras >>>> >>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>> >>>> ERROR: Cannot open configuration file configgiza.fwd! >>>> >>>> Starting MGIZA >>>> >>>> Initializing Global Paras >>>> >>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>> >>>> ERROR: Cannot open configuration file configgiza.rev! >>>> >>>> >>>> >>>> >>>> >>>> This two files does not exists >>>> >>>> should they be generated from the installation? >>>> >>>> How to get them? >>>> >>>> >>>> >>>> *From:* Raj Dabre [mailto:[email protected]] >>>> *Sent:* Sunday, October 26, 2014 6:21 PM >>>> *To:* [email protected] >>>> *Cc:* [email protected] >>>> *Subject:* Re: [Moses-support] Incremental training >>>> >>>> >>>> >>>> Hello Ihab, >>>> >>>> I would suggest using mgiza++. >>>> http://www.kyloo.net/software/doku.php/mgiza:overview >>>> >>>> It is very easy to use. >>>> >>>> I also wrote some scripts to make it easy for training. >>>> Visit the link below for my scripts. >>>> >>>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing >>>> >>>> Usage: >>>> >>>> To train basic IBM models: >>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> >>>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation> >>>> >>>> To align 2 new files using previously trained models (aka continue >>>> training). >>>> >>>> bash align_new.sh <new_src_corpus_file_name> <new_tgt_corpus_file_name> >>>> <old_src_corpus_file_name> <old_tgt_corpus_file_name> <model_folder_base> >>>> <corpus_folder_base> <path_to_mgizapp_installation> >>>> >>>> There is also a python script which you had better replace in the >>>> scripts folder of mgiza++. I have modified it to work with my scripts. >>>> >>>> Hope this helps. >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <[email protected]> >>>> wrote: >>>> >>>> Dear All, >>>> >>>> I just need a clear steps on how to do incremental training in moses, >>>> as the illustration in the manual is not cleared enough >>>> >>>> Thanks >>>> >>>> >>>> >>>> Best Regards >>>> >>>> *Ihab Ramadan*| Senior Developer| Saudisoft <http://www.saudisoft.com/> >>>> - Egypt | *Tel * +2 02 330 320 37 Ext- 0 | Mob+201007570826 | Fax >>>> +20233032036 | *Follow us on *[image: linked] >>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >>>> | >>>> **[image: ZA102637861]* >>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >>>> | >>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Raj Dabre. >>>> Research Student, >>>> >>>> Graduate School of Informatics, >>>> Kyoto University. >>>> >>>> CSE MTech, IITB., 2011-2014 >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Raj Dabre. >>>> Research Student, >>>> >>>> Graduate School of Informatics, >>>> Kyoto University. >>>> >>>> CSE MTech, IITB., 2011-2014 >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Raj Dabre. >>>> Research Student, >>>> >>>> Graduate School of Informatics, >>>> Kyoto University. >>>> >>>> CSE MTech, IITB., 2011-2014 >>>> >>>> _______________________________________________ >>>> Moses-support mailing list >>>> [email protected] >>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>> >>>> >>> >> >> >> -- >> Raj Dabre. >> Research Student, >> Graduate School of Informatics, >> Kyoto University. >> CSE MTech, IITB., 2011-2014 >> >> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
