I removed all of the double spaces from the corpus but there are some
double spaces in the tokenised file yet.
My source language is Persian and I have half-spaces in my corpus. I
noticed that after the tokenisation step,these half-spaces are converted to
double-spaces. this conversion disturb the sentence's length and the
alignment.
How can I prevent from this conversion?

Thank you again
Amir


On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <[email protected]> wrote:

> yes, remove the double space. Sometimes, the double space is ignored,
> sometimes it's counted as a 'word' with no characters, depending on exactly
> how the program tokenizes the line.
>
>
>
>
> On 22 January 2014 10:09, amir haghighi <[email protected]>wrote:
>
>> Thank you Hieu,
>>
>> The corpus is utf8, but there is a double space in this line. are double
>> spaces regarded as a word?
>> should I remove double spaces from the lines manually to get the correct
>> sentence's length?
>>
>>
>>
>> On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang <[email protected]> wrote:
>>
>>>
>>> On 20/01/2014 13:45, amir haghighi wrote:
>>>
>>>   Hello
>>>
>>>  I've some questions about the giza word alignment.
>>>
>>>  1-where is the final alignment file?Is it the aligned.1.grow.... in
>>> the model folder?
>>>
>>> yes.
>>>
>>>
>>>  2-do indexes of the words of both target and source sentences start
>>> from 0?
>>>
>>> yes
>>>
>>>
>>>  3- how does giza calculate the length of a sentence?
>>>
>>> the number of words
>>>
>>>  I have a sentence with 11 tokens that are separated with space, but in
>>> the alignment file it length is 13.
>>>
>>> strange. Are you sure your corpus file is encoded as UTF8? Are there
>>> double spaces in the line?
>>>
>>>
>>>  Regards
>>>  Amir
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing 
>>> [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
>
> --
> Hieu Hoang
> Research Associate
> University of Edinburgh
> http://www.hoang.co.uk/hieu
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to