Thanks a ton Raj. This worked for me too. regards, sandipan On 20 November 2014 11:02, Raj Dabre <[email protected]> wrote:
> Try this one. > > On Thu, Nov 20, 2014 at 7:58 PM, Sandipan Dandapat < > [email protected]> wrote: > >> Hi Raj, >> I am still getting the same error as follows: >> reading vocabulary files >> Reading vocabulary file from:new_corpus/inc.fr.vcb >> ERROR: TOKEN ID must be unique for each token, in line : >> 2 traité 34 >> TOKEN ID 2 has already been assigned to: traité >> >> Your script is generating duplicates items. May be you can forward me >> the .py script again. I hope we are not using different version of the same! >> >> However, I made some changes in the .py script based on your suggestion >> and is working without any error. Please see the attached scripts. >> Regards, >> sandipan >> >> >> On 20 November 2014 09:24, Raj Dabre <[email protected]> wrote: >> >>> Hey, >>> >>> I just remembered that I have a pathetic memory. >>> I forgot to add the lines for sorting the .vcb file in increasing order >>> of id. >>> >>> Just add the following lines to align_new.sh after the line --------- >>> $MGIZA/scripts/plain2snt-hasvcb.py ../corpus/$4.vcb ../corpus/$3.vcb $2 $1 >>> $2_$1.snt $1_$2.snt $2.vcb $1.vcb : >>> >>> sort -n $1.vcb > tmp >>> mv tmp $1.vcb >>> sort -n $2.vcb > tmp >>> mv tmp $2.vcb >>> >>> And it will run perfectly. I am sure of it. I used your folder just to >>> be sure. It works. >>> Sorry for my silliness. Lemme know if it works now. >>> >>> Regards. >>> >>> On Thu, Nov 20, 2014 at 1:13 AM, Raj Dabre <[email protected]> wrote: >>> >>>> Well then your paths must be wrong. >>>> I cant see why the files are not being generated. >>>> Ill look into it tomorrow and let you know >>>> >>>> >>>> On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat < >>>> [email protected]> wrote: >>>> >>>>> When I am using your script then it has no problem. But when modified >>>>> the lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir >>>>> i used these two commands. >>>>> >>>>> sh full_train.sh org.en org.fr >>>>> sh align_new.sh inc.en inc.fr org.en org.fr >>>>> >>>>> Is the above right? >>>>> >>>>> I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE, >>>>> NEW_CORPUS_BASE) hard-coded in the scripts. >>>>> >>>>> >>>>> On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote: >>>>> >>>>>> Cannot open file??? >>>>>> Does the file exist?? >>>>>> Aee you passing the path properly? >>>>>> >>>>>> >>>>>> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> I made the changes based on your suggestions, its now generating a >>>>>>> different error as below: >>>>>>> >>>>>>> >>>>>>> reading vocabulary files >>>>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>>>>>> >>>>>>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil >>>>>>> >>>>>>> I am attaching the working dir and the .py scripts here with. I have >>>>>>> the 10 parallel sentences for incremental alignment is in inc_data/ >>>>>>> where >>>>>>> as the original 500 sentences are there in mtdata/ directory >>>>>>> >>>>>>> Thanks a ton for your help. >>>>>>> >>>>>>> Regards, >>>>>>> sandipan >>>>>>> >>>>>>> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote: >>>>>>> >>>>>>>> Hey, >>>>>>>> >>>>>>>> I am pretty sure that my script does not generate duplicate token >>>>>>>> id. >>>>>>>> >>>>>>>> In fact, I used to get the same error till I modified the script. >>>>>>>> >>>>>>>> In case you do want to avoid this error and not use my script then: >>>>>>>> >>>>>>>> 1. Open the original python script: plain2snt-hasvcb.py >>>>>>>> 2. There is a line which increments the id counter by 1 ( the line >>>>>>>> is nid = len(fvcb)+1;) >>>>>>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id >>>>>>>> numbering starts from 1, and thus if you have 23 tokens then the id >>>>>>>> will go >>>>>>>> from 2 to 24. The original update script will do: nid = 23 + 1 = 24 >>>>>>>> and the >>>>>>>> modification will give 25 correctly). This is in 2 places: nid = >>>>>>>> len(evcb)+2; >>>>>>>> >>>>>>>> Do this and it will work. >>>>>>>> >>>>>>>> In any case... send me a zip file of your working directory (if its >>>>>>>> small.... you are testing it on small data right ? ). I will see what >>>>>>>> the >>>>>>>> problem is. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Dear Raj, >>>>>>>>> I also tried to use your scripts for incremental alignment. I >>>>>>>>> copied your python script in the desired directory still I am >>>>>>>>> receiving the >>>>>>>>> same error as posted by Ihab. >>>>>>>>> reading vocabulary files >>>>>>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb >>>>>>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>>>>>> 24 roi 2 >>>>>>>>> TOKEN ID 24 has already been assigned to: roi >>>>>>>>> >>>>>>>>> I took only 500 sentences pairs for full_train.sh and it worked >>>>>>>>> fine with 758 lines in the corpus/tgt_filename.vcb file >>>>>>>>> >>>>>>>>> I took only 10 sentences for incremental alignment_new.sh which >>>>>>>>> generated the error and I found 8054 lines in the >>>>>>>>> new_corpus/new_tgt_file.vcb >>>>>>>>> Is there any problem? Can you please help me on the same. >>>>>>>>> >>>>>>>>> Thanks and regards, >>>>>>>>> sandipan >>>>>>>>> >>>>>>>>> >>>>>>>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Dear Ihab. >>>>>>>>>> There is a python script that was there in the google drive >>>>>>>>>> folder in the first mail I sent you. >>>>>>>>>> Please replace the existing file with my copy. >>>>>>>>>> >>>>>>>>>> It has to work. >>>>>>>>>> >>>>>>>>>> Regards. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sent from Samsung Mobile >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------- Original message -------- >>>>>>>>>> From: Ihab Ramadan <[email protected]> >>>>>>>>>> Date: 05/11/2014 00:54 (GMT+09:00) >>>>>>>>>> To: 'Raj Dabre' <[email protected]> >>>>>>>>>> Cc: [email protected] >>>>>>>>>> Subject: RE: [Moses-support] Incremental training >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dear Raj, >>>>>>>>>> >>>>>>>>>> Your point is clear and I try to follow the steps you mentioned >>>>>>>>>> but I stuck now in the align_new.sh script which gives me this error >>>>>>>>>> >>>>>>>>>> reading vocabulary files >>>>>>>>>> >>>>>>>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb >>>>>>>>>> >>>>>>>>>> ERROR: TOKEN ID must be unique for each token, in line : >>>>>>>>>> >>>>>>>>>> 29107 q-1 4 >>>>>>>>>> >>>>>>>>>> Do you have any idea what this error means? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>>>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM >>>>>>>>>> *To:* [email protected] >>>>>>>>>> *Cc:* [email protected] >>>>>>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Dear Ihab, >>>>>>>>>> >>>>>>>>>> Perhaps I should have mentioned much more clearly what my script >>>>>>>>>> does. Sorry for that. >>>>>>>>>> >>>>>>>>>> Let me start with this: There is no direct/easy way to generate >>>>>>>>>> the moses.ini file as you need. >>>>>>>>>> >>>>>>>>>> 1. Suppose you have 2 million lines of parallel corpora and you >>>>>>>>>> trained a SMT system for it. This naturally gives the phrase table, >>>>>>>>>> reordering table and moses.ini. >>>>>>>>>> >>>>>>>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there >>>>>>>>>> are 2 ways: >>>>>>>>>> >>>>>>>>>> a. Retrain 2.5 million lines from scratch (will take lots of >>>>>>>>>> time: ~ 2-3 days on a regular machines) >>>>>>>>>> >>>>>>>>>> b. Train on only the 500k new lines using the alignment >>>>>>>>>> information of the original training data. (Faster: ~ 6-7 hours). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT >>>>>>>>>> PHRASE TABLES.* >>>>>>>>>> >>>>>>>>>> 1. full_train.sh -------------- This trains on the original >>>>>>>>>> corpus of 2 million lines. (Generate alignment files only for the >>>>>>>>>> original >>>>>>>>>> corpus) >>>>>>>>>> >>>>>>>>>> 2. align_new.sh -------------- This trains on the new corpus of >>>>>>>>>> 500 k lines. (Generate alignment files only for the new corpus using >>>>>>>>>> the >>>>>>>>>> alignments for 1) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Why this split ????* Because the basic training step of Moses >>>>>>>>>> does not preserve the alignment probability information. Only the >>>>>>>>>> alignments are saved. To continue training we need the probability >>>>>>>>>> information. >>>>>>>>>> >>>>>>>>>> You can pass flags to moses to preserve this information ( this >>>>>>>>>> flag is --giza-option . If you do this then you will not need >>>>>>>>>> full_train.sh. But you will have to change the config files before >>>>>>>>>> using >>>>>>>>>> align_new.sh) >>>>>>>>>> >>>>>>>>>> *HOW TO GET UPDATED PHRASE TABLE:* >>>>>>>>>> >>>>>>>>>> 1. Append the forward alignments (fwd) generated by align_new.sh >>>>>>>>>> to the forward (fwd) alignments generated by full_train.sh. >>>>>>>>>> 2. Append the inverse alignments (inv) generated by align_new.sh >>>>>>>>>> to the inverse (inv) alignments generated by full_train.sh. >>>>>>>>>> >>>>>>>>>> 3. Run the moses training script with additional flags: >>>>>>>>>> >>>>>>>>>> - --first-step -- first step in the training process (default >>>>>>>>>> 1)--------------- This will be 4 >>>>>>>>>> - --last-step -- last step in the training process (default >>>>>>>>>> 7)------------ This will remain 7 >>>>>>>>>> - --giza-f2e -- <path to folder>/new_giza.fwd >>>>>>>>>> - --giza-e2f -- <path to folder>/new_giza.inv >>>>>>>>>> >>>>>>>>>> For example: >>>>>>>>>> >>>>>>>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your >>>>>>>>>> training directory> \ >>>>>>>>>> >>>>>>>>>> -corpus <your new corpus name> \ >>>>>>>>>> >>>>>>>>>> -f <src> -e <tgt> -alignment grow-diag-final-and -reordering >>>>>>>>>> msd-bidirectional-fe \ >>>>>>>>>> >>>>>>>>>> -lm 0:3:<path to LM>:8 \ >>>>>>>>>> --first-step 4 --last-step 7 --giza-f2e -- <path to >>>>>>>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \ >>>>>>>>>> -external-bin-dir <path to giza++ binaries> >>>>>>>>>> >>>>>>>>>> For more details on the training step read this: >>>>>>>>>> http://www.statmt.org/moses/?n=FactoredTraining. >>>>>>>>>> TrainingParameters >>>>>>>>>> >>>>>>>>>> What this does is assumes that you have alignments and continue >>>>>>>>>> the phrase extraction, reordering and generate the new moses.ini >>>>>>>>>> file. >>>>>>>>>> >>>>>>>>>> WARNING: Specify the filenames and paths properly *OR IT WILL >>>>>>>>>> FAIL.* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If you are still unclear then please ask and I will try to help >>>>>>>>>> you as much as I can. >>>>>>>>>> >>>>>>>>>> Regards. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Dear Raj, >>>>>>>>>> >>>>>>>>>> That’s a great work my friend, >>>>>>>>>> >>>>>>>>>> This files make the script work but it takes long time to finish >>>>>>>>>> also it did not generate the model folder which contain the >>>>>>>>>> moses.ini file >>>>>>>>>> >>>>>>>>>> Is this normal? >>>>>>>>>> >>>>>>>>>> And I now try to run it again as I suspect that the server was >>>>>>>>>> shut down before the training was completed but i notice that it >>>>>>>>>> starts >>>>>>>>>> form the beginning and did not use the existing files generated >>>>>>>>>> >>>>>>>>>> Thanks Raj it still a great work >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>>>>>> *Sent:* Thursday, October 30, 2014 4:54 PM >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *To:* [email protected] >>>>>>>>>> *Cc:* [email protected] >>>>>>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Ahh.... i totally forgot that part. >>>>>>>>>> >>>>>>>>>> Sorry. >>>>>>>>>> >>>>>>>>>> PFA. >>>>>>>>>> >>>>>>>>>> Just place them in the folder where the shell scripts >>>>>>>>>> full_train.sh and align_new.sh are. >>>>>>>>>> >>>>>>>>>> Hopefully it should run now. >>>>>>>>>> >>>>>>>>>> Please let me know if you succeed. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Dear Raj, >>>>>>>>>> >>>>>>>>>> It is a great solution >>>>>>>>>> >>>>>>>>>> I installed MGIZA++ successfully and I am using your scripts to >>>>>>>>>> run training >>>>>>>>>> >>>>>>>>>> And I followed the steps you mentioned but I faces this error >>>>>>>>>> when I was running the full_train.sh script >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> bla bla bla >>>>>>>>>> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Starting MGIZA >>>>>>>>>> >>>>>>>>>> Initializing Global Paras >>>>>>>>>> >>>>>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>>>>>>> >>>>>>>>>> ERROR: Cannot open configuration file configgiza.fwd! >>>>>>>>>> >>>>>>>>>> Starting MGIZA >>>>>>>>>> >>>>>>>>>> Initializing Global Paras >>>>>>>>>> >>>>>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments >>>>>>>>>> >>>>>>>>>> ERROR: Cannot open configuration file configgiza.rev! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This two files does not exists >>>>>>>>>> >>>>>>>>>> should they be generated from the installation? >>>>>>>>>> >>>>>>>>>> How to get them? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *From:* Raj Dabre [mailto:[email protected]] >>>>>>>>>> *Sent:* Sunday, October 26, 2014 6:21 PM >>>>>>>>>> *To:* [email protected] >>>>>>>>>> *Cc:* [email protected] >>>>>>>>>> *Subject:* Re: [Moses-support] Incremental training >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hello Ihab, >>>>>>>>>> >>>>>>>>>> I would suggest using mgiza++. http://www.kyloo.net/software/ >>>>>>>>>> doku.php/mgiza:overview >>>>>>>>>> >>>>>>>>>> It is very easy to use. >>>>>>>>>> >>>>>>>>>> I also wrote some scripts to make it easy for training. >>>>>>>>>> Visit the link below for my scripts. >>>>>>>>>> https://drive.google.com/folderview?id= >>>>>>>>>> 0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing >>>>>>>>>> >>>>>>>>>> Usage: >>>>>>>>>> >>>>>>>>>> To train basic IBM models: >>>>>>>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name> >>>>>>>>>> <model_folder_base> <corpus_folder_base> >>>>>>>>>> <path_to_mgizapp_installation> >>>>>>>>>> >>>>>>>>>> To align 2 new files using previously trained models (aka >>>>>>>>>> continue training). >>>>>>>>>> >>>>>>>>>> bash align_new.sh <new_src_corpus_file_name> >>>>>>>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name> >>>>>>>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base> >>>>>>>>>> <path_to_mgizapp_installation> >>>>>>>>>> >>>>>>>>>> There is also a python script which you had better replace in the >>>>>>>>>> scripts folder of mgiza++. I have modified it to work with my >>>>>>>>>> scripts. >>>>>>>>>> >>>>>>>>>> Hope this helps. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Dear All, >>>>>>>>>> >>>>>>>>>> I just need a clear steps on how to do incremental training in >>>>>>>>>> moses, as the illustration in the manual is not cleared enough >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best Regards >>>>>>>>>> >>>>>>>>>> *Ihab Ramadan*| Senior Developer| Saudisoft >>>>>>>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37 >>>>>>>>>> Ext- 0 | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image: >>>>>>>>>> linked] >>>>>>>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>* >>>>>>>>>> | >>>>>>>>>> **[image: ZA102637861]* >>>>>>>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>* >>>>>>>>>> | >>>>>>>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Moses-support mailing list >>>>>>>>>> [email protected] >>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Raj Dabre. >>>>>>>>>> Research Student, >>>>>>>>>> >>>>>>>>>> Graduate School of Informatics, >>>>>>>>>> Kyoto University. >>>>>>>>>> >>>>>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Raj Dabre. >>>>>>>>>> Research Student, >>>>>>>>>> >>>>>>>>>> Graduate School of Informatics, >>>>>>>>>> Kyoto University. >>>>>>>>>> >>>>>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Raj Dabre. >>>>>>>>>> Research Student, >>>>>>>>>> >>>>>>>>>> Graduate School of Informatics, >>>>>>>>>> Kyoto University. >>>>>>>>>> >>>>>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Moses-support mailing list >>>>>>>>>> [email protected] >>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Raj Dabre. >>>>>>>> Research Student, >>>>>>>> Graduate School of Informatics, >>>>>>>> Kyoto University. >>>>>>>> CSE MTech, IITB., 2011-2014 >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> >>> >>> -- >>> Raj Dabre. >>> Research Student, >>> Graduate School of Informatics, >>> Kyoto University. >>> CSE MTech, IITB., 2011-2014 >>> >>> >> > > > -- > Raj Dabre. > Research Student, > Graduate School of Informatics, > Kyoto University. > CSE MTech, IITB., 2011-2014 > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
