I removed all of the double spaces from the corpus but there are some double spaces in the tokenised file yet. My source language is Persian and I have half-spaces in my corpus. I noticed that after the tokenisation step,these half-spaces are converted to double-spaces. this conversion disturb the sentence's length and the alignment. How can I prevent from this conversion?
Thank you again Amir On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <[email protected]> wrote: > yes, remove the double space. Sometimes, the double space is ignored, > sometimes it's counted as a 'word' with no characters, depending on exactly > how the program tokenizes the line. > > > > > On 22 January 2014 10:09, amir haghighi <[email protected]>wrote: > >> Thank you Hieu, >> >> The corpus is utf8, but there is a double space in this line. are double >> spaces regarded as a word? >> should I remove double spaces from the lines manually to get the correct >> sentence's length? >> >> >> >> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang <[email protected]> wrote: >> >>> >>> On 20/01/2014 13:45, amir haghighi wrote: >>> >>> Hello >>> >>> I've some questions about the giza word alignment. >>> >>> 1-where is the final alignment file?Is it the aligned.1.grow.... in >>> the model folder? >>> >>> yes. >>> >>> >>> 2-do indexes of the words of both target and source sentences start >>> from 0? >>> >>> yes >>> >>> >>> 3- how does giza calculate the length of a sentence? >>> >>> the number of words >>> >>> I have a sentence with 11 tokens that are separated with space, but in >>> the alignment file it length is 13. >>> >>> strange. Are you sure your corpus file is encoded as UTF8? Are there >>> double spaces in the line? >>> >>> >>> Regards >>> Amir >>> >>> >>> >>> _______________________________________________ >>> Moses-support mailing >>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support >>> >>> >>> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > > > -- > Hieu Hoang > Research Associate > University of Edinburgh > http://www.hoang.co.uk/hieu > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
