When I am using your script then it has no problem. But when modified the lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir i used these two commands.
sh full_train.sh org.en org.fr sh align_new.sh inc.en inc.fr org.en org.fr Is the above right? I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, NEW_CORPUS_BASE) hard-coded in the scripts. On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote: > Cannot open file??? > Does the file exist?? > Aee you passing the path properly? > > > On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]> > wrote: > >> Hi, >> I made the changes based on your suggestions, its now generating a >> different error as below: >> >> >> reading vocabulary files >> Reading vocabulary file from:new_corpus/inc.fr.vcb >> >> Cannot open vocabulary file new_corpus/inc.fr.vcbfil >> >> I am attaching the working dir and the .py scripts here with. I have the >> 10 parallel sentences for incremental alignment is in inc_data/ where as >> the original 500 sentences are there in mtdata/ directory >> >> Thanks a ton for your help. >> >> Regards, >> sandipan >> >> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote: >> >>> Hey, >>> >>> I am pretty sure that my script does not generate duplicate token id. >>> >>> In fact, I used to get the same error till I modified the script. >>> >>> In case you do want to avoid this error and not use my script then: >>> >>> 1. Open the original python script: plain2snt-hasvcb.py >>> 2. There is a line which increments the id counter by 1 ( the line is >>> nid = len(fvcb)+1;) >>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id numbering >>> starts from 1, and thus if you have 23 tokens then the id will go from 2 to >>> 24. The original update script will do: nid = 23 + 1 = 24 and the >>> modification will give 25 correctly). This is in 2 places: nid = >>> len(evcb)+2; >>> >>> Do this and it will work. >>> >>> In any case... send me a zip file of your working directory (if its >>> small.... you are testing it on small data right ? ). I will see what the >>> problem is. >>> >>> >>> >>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat < >>> [email protected]> wrote: >>> >>>> Dear Raj, >>>> I also tried to use your scripts for incremental alignment. I copied >>>> your python script in the desired directory still I am receiving the same >>>> error as posted by Ihab. >>>> reading vocabulary files >>>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>>> ERROR: TOKEN ID must be unique for each token, in line : >>>> 24 roi 2 >>>> TOKEN ID 24 has already been assigned to: roi >>>> >>>> I took only 500 sentences pairs for full_train.sh and it worked fine >>>> with 758 lines in the corpus/tgt_filename.vcb file >>>> >>>> I took only 10 sentences for incremental alignment_new.sh which >>>> generated the error and I found 8054 lines in the >>>> new_corpus/new_tgt_file.vcb >>>> Is there any problem? Can you please help me on the same. >>>> >>>> Thanks and regards, >>>> sandipan >>>> >>>> >>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote: >>>> >>>>> Dear Ihab. >>>>> There is a python script that was there in the google drive folder in >>>>> the first mail I sent you. >>>>> Please replace the existing file with my copy. >>>>> >>>>> It has to work. >>>>> >>>>> Regards. >>>>> >>>>> >>>>> Sent from Samsung Mobile >>>>> >>>>> >>>>> >>>>> -------- Original message -------- >>>>> From: Ihab Ramadan <[email protected]> >>>>> Date: 05/11/2014 00:54 (GMT+09:00) >>>>> To: 'Raj Dabre' <[email protected]> >>>>> Cc: [email protected] >>>>> Subject: RE: [Moses-support] Incremental training >>>>> >>>>> >>>>> Dear Raj, >>>>> >>>>> Your point is clear and I try to follow the steps you mentioned but I >>>>> stuck now in the align_new.sh script which gives me this error >>>>> >>>>> reading vocabulary files >>>>> >>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb >>>>> >>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>> >>>>> 29107 q-1 4 >>>>> >>>>> Do you have any idea what this error means? >>>>> >>>>> >>>>> >>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM >>>>> *To:* [email protected] >>>>> *Cc:* [email protected] >>>>> *Subject:* Re: [Moses-support] Incremental training >>>>> >>>>> >>>>> >>>>> Dear Ihab, >>>>> >>>>> Perhaps I should have mentioned much more clearly what my script does. >>>>> Sorry for that. >>>>> >>>>> Let me start with this: There is no direct/easy way to generate the >>>>> moses.ini file as you need. >>>>> >>>>> 1. Suppose you have 2 million lines of parallel corpora and you >>>>> trained a SMT system for it. This naturally gives the phrase table, >>>>> reordering table and moses.ini. >>>>> >>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there are >>>>> 2 ways: >>>>> >>>>> a. Retrain 2.5 million lines from scratch (will take lots of time: >>>>> ~ 2-3 days on a regular machines) >>>>> >>>>> b. Train on only the 500k new lines using the alignment >>>>> information of the original training data. (Faster: ~ 6-7 hours). >>>>> >>>>> >>>>> >>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE >>>>> TABLES.* >>>>> >>>>> 1. full_train.sh -------------- This trains on the original corpus of >>>>> 2 million lines. (Generate alignment files only for the original corpus) >>>>> >>>>> 2. align_new.sh -------------- This trains on the new corpus of 500 k >>>>> lines. (Generate alignment files only for the new corpus using the >>>>> alignments for 1) >>>>> >>>>> >>>>> >>>>> *Why this split ????* Because the basic training step of Moses does >>>>> not preserve the alignment probability information. Only the alignments >>>>> are >>>>> saved. To continue training we need the probability information. >>>>> >>>>> You can pass flags to moses to preserve this information ( this flag >>>>> is --giza-option . If you do this then you will not need >>>>> full_train.sh. But you will have to change the config files before using >>>>> align_new.sh) >>>>> >>>>> *HOW TO GET UPDATED PHRASE TABLE:* >>>>> >>>>> 1. Append the forward alignments (fwd) generated by align_new.sh to >>>>> the forward (fwd) alignments generated by full_train.sh. >>>>> 2. Append the inverse alignments (inv) generated by align_new.sh to >>>>> the inverse (inv) alignments generated by full_train.sh. >>>>> >>>>> 3. Run the moses training script with additional flags: >>>>> >>>>> - --first-step -- first step in the training process (default >>>>> 1)--------------- This will be 4 >>>>> - --last-step -- last step in the training process (default >>>>> 7)------------ This will remain 7 >>>>> - --giza-f2e -- <path to folder>/new_giza.fwd >>>>> - --giza-e2f -- <path to folder>/new_giza.inv >>>>> >>>>> For example: >>>>> >>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your training >>>>> directory> \ >>>>> >>>>> -corpus <your new corpus name> \ >>>>> >>>>> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering >>>>> msd-bidirectional-fe \ >>>>> >>>>> -lm 0:3:<path to LM>:8 \ >>>>> --first-step 4 --last-step 7 --giza-f2e -- <path to >>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \ >>>>> -external-bin-dir <path to giza++ binaries> >>>>> >>>>> For more details on the training step read this: >>>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters >>>>> >>>>> What this does is assumes that you have alignments and continue the >>>>> phrase extraction, reordering and generate the new moses.ini file. >>>>> >>>>> WARNING: Specify the filenames and paths properly *OR IT WILL FAIL.* >>>>> >>>>> >>>>> >>>>> If you are still unclear then please ask and I will try to help you as >>>>> much as I can. >>>>> >>>>> Regards. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <[email protected]> >>>>> wrote: >>>>> >>>>> Dear Raj, >>>>> >>>>> That’s a great work my friend, >>>>> >>>>> This files make the script work but it takes long time to finish also >>>>> it did not generate the model folder which contain the moses.ini file >>>>> >>>>> Is this normal? >>>>> >>>>> And I now try to run it again as I suspect that the server was shut >>>>> down before the training was completed but i notice that it starts form >>>>> the >>>>> beginning and did not use the existing files generated >>>>> >>>>> Thanks Raj it still a great work >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>> *Sent:* Thursday, October 30, 2014 4:54 PM >>>>> >>>>> >>>>> *To:* [email protected] >>>>> *Cc:* [email protected] >>>>> *Subject:* Re: [Moses-support] Incremental training >>>>> >>>>> >>>>> >>>>> Ahh.... i totally forgot that part. >>>>> >>>>> Sorry. >>>>> >>>>> PFA. >>>>> >>>>> Just place them in the folder where the shell scripts full_train.sh >>>>> and align_new.sh are. >>>>> >>>>> Hopefully it should run now. >>>>> >>>>> Please let me know if you succeed. >>>>> >>>>> >>>>> >>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan < >>>>> [email protected]> wrote: >>>>> >>>>> Dear Raj, >>>>> >>>>> It is a great solution >>>>> >>>>> I installed MGIZA++ successfully and I am using your scripts to run >>>>> training >>>>> >>>>> And I followed the steps you mentioned but I faces this error when I >>>>> was running the full_train.sh script >>>>> >>>>> >>>>> >>>>> bla bla bla >>>>> >>>>> . >>>>> >>>>> . >>>>> >>>>> . >>>>> >>>>> . >>>>> >>>>> >>>>> >>>>> Starting MGIZA >>>>> >>>>> Initializing Global Paras >>>>> >>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>> >>>>> ERROR: Cannot open configuration file configgiza.fwd! >>>>> >>>>> Starting MGIZA >>>>> >>>>> Initializing Global Paras >>>>> >>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>> >>>>> ERROR: Cannot open configuration file configgiza.rev! >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> This two files does not exists >>>>> >>>>> should they be generated from the installation? >>>>> >>>>> How to get them? >>>>> >>>>> >>>>> >>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>> *Sent:* Sunday, October 26, 2014 6:21 PM >>>>> *To:* [email protected] >>>>> *Cc:* [email protected] >>>>> *Subject:* Re: [Moses-support] Incremental training >>>>> >>>>> >>>>> >>>>> Hello Ihab, >>>>> >>>>> I would suggest using mgiza++. >>>>> http://www.kyloo.net/software/doku.php/mgiza:overview >>>>> >>>>> It is very easy to use. >>>>> >>>>> I also wrote some scripts to make it easy for training. >>>>> Visit the link below for my scripts. >>>>> >>>>> https://drive.google.com/folderview?id=0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing >>>>> >>>>> Usage: >>>>> >>>>> To train basic IBM models: >>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> >>>>> <model_folder_base> <corpus_folder_base> <path_to_mgizapp_installation> >>>>> >>>>> To align 2 new files using previously trained models (aka continue >>>>> training). >>>>> >>>>> bash align_new.sh <new_src_corpus_file_name> >>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name> >>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base> >>>>> <path_to_mgizapp_installation> >>>>> >>>>> There is also a python script which you had better replace in the >>>>> scripts folder of mgiza++. I have modified it to work with my scripts. >>>>> >>>>> Hope this helps. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan < >>>>> [email protected]> wrote: >>>>> >>>>> Dear All, >>>>> >>>>> I just need a clear steps on how to do incremental training in moses, >>>>> as the illustration in the manual is not cleared enough >>>>> >>>>> Thanks >>>>> >>>>> >>>>> >>>>> Best Regards >>>>> >>>>> *Ihab Ramadan*| Senior Developer| Saudisoft >>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37 Ext- 0 >>>>> | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image: linked] >>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >>>>> | >>>>> **[image: ZA102637861]* >>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >>>>> | >>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Raj Dabre. >>>>> Research Student, >>>>> >>>>> Graduate School of Informatics, >>>>> Kyoto University. >>>>> >>>>> CSE MTech, IITB., 2011-2014 >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Raj Dabre. >>>>> Research Student, >>>>> >>>>> Graduate School of Informatics, >>>>> Kyoto University. >>>>> >>>>> CSE MTech, IITB., 2011-2014 >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Raj Dabre. >>>>> Research Student, >>>>> >>>>> Graduate School of Informatics, >>>>> Kyoto University. >>>>> >>>>> CSE MTech, IITB., 2011-2014 >>>>> >>>>> _______________________________________________ >>>>> Moses-support mailing list >>>>> [email protected] >>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>> >>>>> >>>> >>> >>> >>> -- >>> Raj Dabre. >>> Research Student, >>> Graduate School of Informatics, >>> Kyoto University. >>> CSE MTech, IITB., 2011-2014 >>> >>> >>
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
