Re: [Moses-support] Incremental training

Raj Dabre Thu, 20 Nov 2014 03:06:13 -0800

Try this one.

On Thu, Nov 20, 2014 at 7:58 PM, Sandipan Dandapat <
[email protected]> wrote:


> Hi Raj,
> I am still getting the same error as follows:
> reading vocabulary files
> Reading vocabulary file from:new_corpus/inc.fr.vcb
> ERROR: TOKEN ID must be unique for each token, in line :
> 2 traité 34
> TOKEN ID 2 has already been assigned to: traité
>
> Your script is generating duplicates items.  May be you can forward me the
> .py script again. I hope we are not using different version of the same!
>
> However, I made some changes in the .py script based on your suggestion
> and is working without any error. Please see the attached scripts.
> Regards,
> sandipan
>
>
> On 20 November 2014 09:24, Raj Dabre <[email protected]> wrote:
>
>> Hey,
>>
>> I just remembered that I have a pathetic memory.
>> I forgot to add the lines for sorting the .vcb file in increasing order
>> of id.
>>
>> Just add the following lines to align_new.sh after the line ---------
>> $MGIZA/scripts/plain2snt-hasvcb.py ../corpus/$4.vcb ../corpus/$3.vcb $2 $1
>> $2_$1.snt $1_$2.snt  $2.vcb $1.vcb :
>>
>> sort -n $1.vcb > tmp
>> mv tmp $1.vcb
>> sort -n $2.vcb > tmp
>> mv tmp $2.vcb
>>
>> And it will run perfectly. I am sure of it. I used your folder just to be
>> sure. It works.
>> Sorry for my silliness. Lemme know if it works now.
>>
>> Regards.
>>
>> On Thu, Nov 20, 2014 at 1:13 AM, Raj Dabre <[email protected]> wrote:
>>
>>> Well then your paths must be wrong.
>>> I cant see why the files are not being generated.
>>> Ill look into it tomorrow and let you know
>>>
>>>
>>> On 01:10, Thu, 20 Nov 2014 Sandipan Dandapat <[email protected]>
>>> wrote:
>>>
>>>> When I am using your script then it has no problem. But when modified
>>>> the lines nid = len(fvcb)+2; there is no .vcb files in the new_corpus/ dir
>>>> i used these two commands.
>>>>
>>>> sh full_train.sh org.en org.fr
>>>>  sh align_new.sh inc.en inc.fr org.en org.fr
>>>>
>>>> Is the above right?
>>>>
>>>> I have kept the paths (MGIZA, MODEL_BASE and CORPUS_BASE,
>>>> NEW_CORPUS_BASE) hard-coded in the scripts.
>>>>
>>>>
>>>> On 19 November 2014 15:49, Raj Dabre <[email protected]> wrote:
>>>>
>>>>> Cannot open file???
>>>>> Does the file exist??
>>>>> Aee you passing the path properly?
>>>>>
>>>>>
>>>>> On 00:44, Thu, 20 Nov 2014 Sandipan Dandapat <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I made the changes based on your suggestions, its now generating a
>>>>>> different error as below:
>>>>>>
>>>>>>
>>>>>> reading vocabulary files
>>>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>>>>>
>>>>>> Cannot open vocabulary file new_corpus/inc.fr.vcbfil
>>>>>>
>>>>>> I am attaching the working dir and the .py scripts here with. I have
>>>>>> the 10 parallel sentences for incremental alignment is in inc_data/ where
>>>>>> as the original 500 sentences are there in mtdata/ directory
>>>>>>
>>>>>> Thanks a ton for your help.
>>>>>>
>>>>>> Regards,
>>>>>> sandipan
>>>>>>
>>>>>> On 19 November 2014 15:18, Raj Dabre <[email protected]> wrote:
>>>>>>
>>>>>>> Hey,
>>>>>>>
>>>>>>> I am pretty sure that my script does not generate duplicate token id.
>>>>>>>
>>>>>>> In fact, I used to get the same error till I modified the script.
>>>>>>>
>>>>>>> In case you do want to avoid this error and not use my script then:
>>>>>>>
>>>>>>> 1. Open the original python script: plain2snt-hasvcb.py
>>>>>>> 2. There is a line which increments the id counter by 1 ( the line
>>>>>>> is nid = len(fvcb)+1;)
>>>>>>> 3. Make this line: nid = len(fvcb)+2; (This is cause the id
>>>>>>> numbering starts from 1, and thus if you have 23 tokens then the id 
>>>>>>> will go
>>>>>>> from 2 to 24. The original update script will do: nid = 23 + 1 = 24 and 
>>>>>>> the
>>>>>>> modification will give 25 correctly). This is in 2 places: nid =
>>>>>>> len(evcb)+2;
>>>>>>>
>>>>>>> Do this and it will work.
>>>>>>>
>>>>>>> In any case... send me a zip file of your working directory (if its
>>>>>>> small.... you are testing it on small data right ? ). I will see what 
>>>>>>> the
>>>>>>> problem is.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 19, 2014 at 11:44 PM, Sandipan Dandapat <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Dear Raj,
>>>>>>>> I also tried to use your scripts for incremental alignment. I
>>>>>>>> copied your python script in the desired directory still I am 
>>>>>>>> receiving the
>>>>>>>> same error as posted by Ihab.
>>>>>>>> reading vocabulary files
>>>>>>>> Reading vocabulary file from:new_corpus/inc.fr.vcb
>>>>>>>> ERROR: TOKEN ID must be unique for each token, in line :
>>>>>>>> 24 roi 2
>>>>>>>> TOKEN ID 24 has already been assigned to: roi
>>>>>>>>
>>>>>>>> I took only 500 sentences pairs for full_train.sh and it worked
>>>>>>>> fine with 758 lines in the corpus/tgt_filename.vcb file
>>>>>>>>
>>>>>>>> I took only 10 sentences for incremental alignment_new.sh which
>>>>>>>> generated the error and I found 8054 lines in the
>>>>>>>> new_corpus/new_tgt_file.vcb
>>>>>>>> Is there any problem? Can you please help me on the same.
>>>>>>>>
>>>>>>>> Thanks and regards,
>>>>>>>> sandipan
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4 November 2014 16:13, prajdabre <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Dear Ihab.
>>>>>>>>> There is a python script that was there in the google drive folder
>>>>>>>>> in the first mail I sent you.
>>>>>>>>> Please replace the existing file with my copy.
>>>>>>>>>
>>>>>>>>> It has to work.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sent from Samsung Mobile
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------- Original message --------
>>>>>>>>> From: Ihab Ramadan <[email protected]>
>>>>>>>>> Date: 05/11/2014 00:54 (GMT+09:00)
>>>>>>>>> To: 'Raj Dabre' <[email protected]>
>>>>>>>>> Cc: [email protected]
>>>>>>>>> Subject: RE: [Moses-support] Incremental training
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear Raj,
>>>>>>>>>
>>>>>>>>> Your point is clear and I try to follow the steps you mentioned
>>>>>>>>> but I stuck now in the align_new.sh script which gives me this error
>>>>>>>>>
>>>>>>>>> reading vocabulary files
>>>>>>>>>
>>>>>>>>> Reading vocabulary file from:new_corpus/TraningTarget.txt.vcb
>>>>>>>>>
>>>>>>>>> ERROR: TOKEN ID must be unique for each token, in line :
>>>>>>>>>
>>>>>>>>> 29107 q-1 4
>>>>>>>>>
>>>>>>>>> Do you have any idea what this error means?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>>>>>> *Sent:* Tuesday, November 4, 2014 12:06 PM
>>>>>>>>> *To:* [email protected]
>>>>>>>>> *Cc:* [email protected]
>>>>>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear Ihab,
>>>>>>>>>
>>>>>>>>> Perhaps I should have mentioned much more clearly what my script
>>>>>>>>> does. Sorry for that.
>>>>>>>>>
>>>>>>>>> Let me start with this: There is no direct/easy way to generate
>>>>>>>>> the moses.ini file as you need.
>>>>>>>>>
>>>>>>>>> 1. Suppose you have 2 million lines of parallel corpora and you
>>>>>>>>> trained a SMT system for it. This naturally gives the phrase table,
>>>>>>>>> reordering table and moses.ini.
>>>>>>>>>
>>>>>>>>> 2. Suppose you got 500 k more lines of parallel corpora.... there
>>>>>>>>> are 2 ways:
>>>>>>>>>
>>>>>>>>>     a. Retrain 2.5 million lines from scratch (will take lots of
>>>>>>>>> time: ~ 2-3 days on a regular machines)
>>>>>>>>>
>>>>>>>>>     b. Train on only the 500k new lines using the alignment
>>>>>>>>> information of the original training data. (Faster: ~ 6-7 hours).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What my scripts do: *THEY ONLY GENERATE ALIGNMENTS and NOT PHRASE
>>>>>>>>> TABLES.*
>>>>>>>>>
>>>>>>>>> 1. full_train.sh -------------- This trains on the original corpus
>>>>>>>>> of 2 million lines. (Generate alignment files only for the original 
>>>>>>>>> corpus)
>>>>>>>>>
>>>>>>>>> 2. align_new.sh -------------- This trains on the new corpus of
>>>>>>>>> 500 k lines. (Generate alignment files only for the new corpus using 
>>>>>>>>> the
>>>>>>>>> alignments for 1)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Why this split ????* Because the basic training step of Moses
>>>>>>>>> does not preserve the alignment probability information. Only the
>>>>>>>>> alignments are saved. To continue training we need the probability
>>>>>>>>> information.
>>>>>>>>>
>>>>>>>>> You can pass flags to moses to preserve this information ( this
>>>>>>>>> flag is --giza-option . If you do this then you will not need
>>>>>>>>> full_train.sh. But you will have to change the config files before 
>>>>>>>>> using
>>>>>>>>> align_new.sh)
>>>>>>>>>
>>>>>>>>> *HOW TO GET UPDATED PHRASE TABLE:*
>>>>>>>>>
>>>>>>>>> 1. Append the forward alignments (fwd) generated by align_new.sh
>>>>>>>>> to the forward (fwd) alignments generated by full_train.sh.
>>>>>>>>> 2. Append the inverse alignments (inv) generated by align_new.sh
>>>>>>>>> to the inverse (inv) alignments generated by full_train.sh.
>>>>>>>>>
>>>>>>>>> 3. Run the moses training script with additional flags:
>>>>>>>>>
>>>>>>>>>    - --first-step -- first step in the training process (default
>>>>>>>>>    1)--------------- This will be 4
>>>>>>>>>    - --last-step -- last step in the training process (default
>>>>>>>>>    7)------------ This will remain 7
>>>>>>>>>    - --giza-f2e -- <path to folder>/new_giza.fwd
>>>>>>>>>    - --giza-e2f -- <path to folder>/new_giza.inv
>>>>>>>>>
>>>>>>>>> For example:
>>>>>>>>>
>>>>>>>>> ~/mosesdecoder/scripts/training/train-model.perl -root-dir <your 
>>>>>>>>> training directory> \
>>>>>>>>>
>>>>>>>>>  -corpus <your new corpus name>                             \
>>>>>>>>>
>>>>>>>>>  -f <src> -e <tgt> -alignment grow-diag-final-and -reordering 
>>>>>>>>> msd-bidirectional-fe \
>>>>>>>>>
>>>>>>>>>  -lm 0:3:<path to LM>:8                          \
>>>>>>>>>  --first-step 4  --last-step 7 --giza-f2e -- <path to 
>>>>>>>>> folder>/new_giza.fwd --giza-e2f -- <path to folder>/new_giza.inv \
>>>>>>>>>  -external-bin-dir <path to giza++ binaries>
>>>>>>>>>
>>>>>>>>> For more details on the training step read this:
>>>>>>>>> http://www.statmt.org/moses/?n=FactoredTraining.TrainingParameters
>>>>>>>>>
>>>>>>>>> What this does is assumes that you have alignments and continue
>>>>>>>>> the phrase extraction, reordering and generate the new moses.ini file.
>>>>>>>>>
>>>>>>>>> WARNING: Specify the filenames and paths properly *OR IT WILL
>>>>>>>>> FAIL.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you are still unclear then please ask and I will try to help
>>>>>>>>> you as much as I can.
>>>>>>>>>
>>>>>>>>> Regards.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Nov 4, 2014 at 6:09 PM, Ihab Ramadan <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Dear Raj,
>>>>>>>>>
>>>>>>>>> That’s a great work my friend,
>>>>>>>>>
>>>>>>>>> This files make the script work but it takes long time to finish
>>>>>>>>> also it did not generate the model folder which contain the moses.ini 
>>>>>>>>> file
>>>>>>>>>
>>>>>>>>> Is this normal?
>>>>>>>>>
>>>>>>>>> And I now try to run it again as I suspect that the server was
>>>>>>>>> shut down before the training was completed but i notice that it 
>>>>>>>>> starts
>>>>>>>>> form the beginning and did not use the existing files generated
>>>>>>>>>
>>>>>>>>> Thanks Raj it still a great work
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>>>>>> *Sent:* Thursday, October 30, 2014 4:54 PM
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *To:* [email protected]
>>>>>>>>> *Cc:* [email protected]
>>>>>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ahh.... i totally forgot that part.
>>>>>>>>>
>>>>>>>>> Sorry.
>>>>>>>>>
>>>>>>>>> PFA.
>>>>>>>>>
>>>>>>>>> Just place them in the folder where the shell scripts
>>>>>>>>> full_train.sh and align_new.sh are.
>>>>>>>>>
>>>>>>>>> Hopefully it should run now.
>>>>>>>>>
>>>>>>>>> Please let me know if you succeed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Oct 30, 2014 at 11:44 PM, Ihab Ramadan <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Dear Raj,
>>>>>>>>>
>>>>>>>>> It is a great solution
>>>>>>>>>
>>>>>>>>> I installed MGIZA++ successfully and I am using your scripts to
>>>>>>>>> run training
>>>>>>>>>
>>>>>>>>> And I followed the steps you mentioned but I faces this error when
>>>>>>>>> I was running the full_train.sh script
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> bla bla  bla
>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Starting MGIZA
>>>>>>>>>
>>>>>>>>> Initializing Global Paras
>>>>>>>>>
>>>>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>>>>>>>>
>>>>>>>>> ERROR:  Cannot open configuration file configgiza.fwd!
>>>>>>>>>
>>>>>>>>> Starting MGIZA
>>>>>>>>>
>>>>>>>>> Initializing Global Paras
>>>>>>>>>
>>>>>>>>> DEBUG: EnterDEBUG: PrefixDEBUG: LogParsing Arguments
>>>>>>>>>
>>>>>>>>> ERROR:  Cannot open configuration file configgiza.rev!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This two files does not exists
>>>>>>>>>
>>>>>>>>> should they be generated from the installation?
>>>>>>>>>
>>>>>>>>> How to get them?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From:* Raj Dabre [mailto:[email protected]]
>>>>>>>>> *Sent:* Sunday, October 26, 2014 6:21 PM
>>>>>>>>> *To:* [email protected]
>>>>>>>>> *Cc:* [email protected]
>>>>>>>>> *Subject:* Re: [Moses-support] Incremental training
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hello Ihab,
>>>>>>>>>
>>>>>>>>> I would suggest using mgiza++. http://www.kyloo.net/software/
>>>>>>>>> doku.php/mgiza:overview
>>>>>>>>>
>>>>>>>>> It is very easy to use.
>>>>>>>>>
>>>>>>>>> I also wrote some scripts to make it easy for training.
>>>>>>>>> Visit the link below for my scripts.
>>>>>>>>> https://drive.google.com/folderview?id=
>>>>>>>>> 0B2gN8qfxTTUoSU43OFBhZXpPZ3M&usp=sharing
>>>>>>>>>
>>>>>>>>> Usage:
>>>>>>>>>
>>>>>>>>> To train basic IBM models:
>>>>>>>>> bash full_train.sh <src_corpus_file_name> <tgt_corpus_file_name>
>>>>>>>>> <model_folder_base> <corpus_folder_base> 
>>>>>>>>> <path_to_mgizapp_installation>
>>>>>>>>>
>>>>>>>>> To align 2 new files using previously trained models (aka continue
>>>>>>>>> training).
>>>>>>>>>
>>>>>>>>> bash align_new.sh <new_src_corpus_file_name>
>>>>>>>>> <new_tgt_corpus_file_name> <old_src_corpus_file_name>
>>>>>>>>> <old_tgt_corpus_file_name> <model_folder_base> <corpus_folder_base>
>>>>>>>>> <path_to_mgizapp_installation>
>>>>>>>>>
>>>>>>>>> There is also a python script which you had better replace in the
>>>>>>>>> scripts folder of mgiza++. I have modified it to work with my scripts.
>>>>>>>>>
>>>>>>>>> Hope this helps.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Oct 26, 2014 at 11:05 PM, Ihab Ramadan <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Dear All,
>>>>>>>>>
>>>>>>>>> I just need a clear steps on how to do incremental training in
>>>>>>>>> moses, as the illustration in the manual is not cleared enough
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards
>>>>>>>>>
>>>>>>>>> *Ihab Ramadan*| Senior Developer| Saudisoft
>>>>>>>>> <http://www.saudisoft.com/> - Egypt | *Tel * +2 02 330 320 37
>>>>>>>>> Ext- 0 | Mob+201007570826 | Fax+20233032036 | *Follow us on *[image:
>>>>>>>>> linked]
>>>>>>>>> <http://www.linkedin.com/company/77017?trk=vsrp_companies_res_name&trkInfo=VSRPsearchId%3A1489659901402995947155%2CVSRPtargetId%3A77017%2CVSRPcmpt%3Aprimary>*
>>>>>>>>>  |
>>>>>>>>> **[image: ZA102637861]*
>>>>>>>>> <https://www.facebook.com/pages/Saudisoft-Co-Ltd/289968997768973?ref_type=bookmark>*
>>>>>>>>>  |
>>>>>>>>> **[image: ZA102637858]* <https://twitter.com/Saudisoft>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Moses-support mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Raj Dabre.
>>>>>>>>> Research Student,
>>>>>>>>>
>>>>>>>>> Graduate School of Informatics,
>>>>>>>>> Kyoto University.
>>>>>>>>>
>>>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Raj Dabre.
>>>>>>>>> Research Student,
>>>>>>>>>
>>>>>>>>> Graduate School of Informatics,
>>>>>>>>> Kyoto University.
>>>>>>>>>
>>>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Raj Dabre.
>>>>>>>>> Research Student,
>>>>>>>>>
>>>>>>>>> Graduate School of Informatics,
>>>>>>>>> Kyoto University.
>>>>>>>>>
>>>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Moses-support mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Raj Dabre.
>>>>>>> Research Student,
>>>>>>> Graduate School of Informatics,
>>>>>>> Kyoto University.
>>>>>>> CSE MTech, IITB., 2011-2014
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>>
>> --
>> Raj Dabre.
>> Research Student,
>> Graduate School of Informatics,
>> Kyoto University.
>> CSE MTech, IITB., 2011-2014
>>
>>
>


-- 
Raj Dabre.
Research Student,
Graduate School of Informatics,
Kyoto University.
CSE MTech, IITB., 2011-2014

#!/usr/bin/env python

from sys import *

def loadvcb(fname,out):
	dict={};
	df = open(fname,"r");
	for line in df:
		#out.write(line);
		ws = line.strip().split();
		id = int(ws[0]);
		wd = ws[1];
		count=int(ws[2]);
		dict[wd]=[id,count];
	print len(dict)
	return dict;

if len(argv)<9:
	stderr.write("Error, the input should be \n");
	stderr.write("%s evcb fvcb etxt ftxt esnt(out) fsnt(out) evcbx(out) fvcbx(out)\n" % argv[0]);
	stderr.write("You should concatenate the evcbx and fvcbx to existing vcb files\n");
	exit();

ein = open(argv[3],"r");
fin = open(argv[4],"r");

eout = open(argv[5],"w");
fout = open(argv[6],"w");

evcbx = open(argv[7],"w");
fvcbx = open(argv[8],"w");
evcb = loadvcb(argv[1],evcbx);
fvcb = loadvcb(argv[2],fvcbx);

i=0
while True:
	i+=1;
	eline=ein.readline();
	fline=fin.readline();
	#print i;
	if len(eline)==0 or len(fline)==0:
		break;
	ewords = eline.strip().split();
	fwords = fline.strip().split();
	el = [];
	fl = [];
	j=0;
	for w in ewords:
		j+=1
		if evcb.has_key(w):
			el.append(evcb[w][0]);
			evcb[w] = [evcb[w][0],evcb[w][1]+1];
		else:
			if evcb.has_key(w.lower()):
				el.append(evcb[w.lower()][0]);
				evcb[w.lower()] = [evcb[w.lower()][0],evcb[w.lower()][1]+1];
			else:
				##stdout.write("#E %d %d %s\n" % (i,j,w))
				#el.append(1);
				nid = len(evcb)+2;
				evcb[w.lower()] = [nid,1];
				#evcbx.write("%d %s 1\n" % (nid, w));
				el.append(nid);

	j=0;
	for w in fwords:
		j+=1
		if fvcb.has_key(w):
			fl.append(fvcb[w][0]);
			fvcb[w] = [fvcb[w][0],fvcb[w][1]+1];
		else:
			if fvcb.has_key(w.lower()):
				fl.append(fvcb[w.lower()][0]);
				fvcb[w.lower()] = [fvcb[w.lower()][0],fvcb[w.lower()][1]+1];
			else:
				#stdout.write("#F %d %d %s\n" % (i,j,w))
				nid = len(fvcb)+2;
				fvcb[w.lower()] = [nid,1];
				#fvcbx.write("%d %s 1\n" % (nid, w));
				fl.append(nid);
				#fl.append(1);
	
	
	
	eout.write("1\n");
	fout.write("1\n");
	for I in el:
		eout.write("%d " % I);
	eout.write("\n");
	for I in fl:
		eout.write("%d " % I);
		fout.write("%d " % I);
	eout.write("\n");
	fout.write("\n");
	for I in el:
		fout.write("%d " % I);
	fout.write("\n");

for word in evcb.keys():
	evcbx.write("%d %s %d\n" % (evcb[word][0], word,evcb[word][1]));

for word in fvcb.keys():
	fvcbx.write("%d %s %d\n" % (fvcb[word][0], word,fvcb[word][1]));
fout.close();
eout.close();
fvcbx.close();
evcbx.close();

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Incremental training

Reply via email to