yes, remove the double space. Sometimes, the double space is ignored,
sometimes it's counted as a 'word' with no characters, depending on exactly
how the program tokenizes the line.




On 22 January 2014 10:09, amir haghighi <[email protected]> wrote:

> Thank you Hieu,
>
> The corpus is utf8, but there is a double space in this line. are double
> spaces regarded as a word?
> should I remove double spaces from the lines manually to get the correct
> sentence's length?
>
>
>
> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang <[email protected]> wrote:
>
>>
>> On 20/01/2014 13:45, amir haghighi wrote:
>>
>>   Hello
>>
>>  I've some questions about the giza word alignment.
>>
>>  1-where is the final alignment file?Is it the aligned.1.grow.... in the
>> model folder?
>>
>> yes.
>>
>>
>>  2-do indexes of the words of both target and source sentences start
>> from 0?
>>
>> yes
>>
>>
>>  3- how does giza calculate the length of a sentence?
>>
>> the number of words
>>
>>  I have a sentence with 11 tokens that are separated with space, but in
>> the alignment file it length is 13.
>>
>> strange. Are you sure your corpus file is encoded as UTF8? Are there
>> double spaces in the line?
>>
>>
>>  Regards
>>  Amir
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing 
>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to