yes, remove the double space. Sometimes, the double space is ignored, sometimes it's counted as a 'word' with no characters, depending on exactly how the program tokenizes the line.
On 22 January 2014 10:09, amir haghighi <[email protected]> wrote: > Thank you Hieu, > > The corpus is utf8, but there is a double space in this line. are double > spaces regarded as a word? > should I remove double spaces from the lines manually to get the correct > sentence's length? > > > > On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang <[email protected]> wrote: > >> >> On 20/01/2014 13:45, amir haghighi wrote: >> >> Hello >> >> I've some questions about the giza word alignment. >> >> 1-where is the final alignment file?Is it the aligned.1.grow.... in the >> model folder? >> >> yes. >> >> >> 2-do indexes of the words of both target and source sentences start >> from 0? >> >> yes >> >> >> 3- how does giza calculate the length of a sentence? >> >> the number of words >> >> I have a sentence with 11 tokens that are separated with space, but in >> the alignment file it length is 13. >> >> strange. Are you sure your corpus file is encoded as UTF8? Are there >> double spaces in the line? >> >> >> Regards >> Amir >> >> >> >> _______________________________________________ >> Moses-support mailing >> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support >> >> >> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
