What tokenizer are you using? You can either edit/configure the
tokenizer to treat them as non-whitespace, or escape them before passing
them to the tokenizer.
On 01/24/2014 12:36 PM, amir haghighi wrote:
I removed all of the double spaces from the corpus but there are some
double spaces in the tokenised file yet.
My source language is Persian and I have half-spaces in my corpus. I
noticed that after the tokenisation step,these half-spaces are
converted to double-spaces. this conversion disturb the sentence's
length and the alignment.
How can I prevent from this conversion?
Thank you again
Amir
On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <[email protected]
<mailto:[email protected]>> wrote:
yes, remove the double space. Sometimes, the double space is
ignored, sometimes it's counted as a 'word' with no characters,
depending on exactly how the program tokenizes the line.
On 22 January 2014 10:09, amir haghighi
<[email protected] <mailto:[email protected]>>
wrote:
Thank you Hieu,
The corpus is utf8, but there is a double space in this line.
are double spaces regarded as a word?
should I remove double spaces from the lines manually to get
the correct sentence's length?
On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang
<[email protected] <mailto:[email protected]>> wrote:
On 20/01/2014 13:45, amir haghighi wrote:
Hello
I've some questions about the giza word alignment.
1-where is the final alignment file?Is it the
aligned.1.grow.... in the model folder?
yes.
2-do indexes of the words of both target and source
sentences start from 0?
yes
3- how does giza calculate the length of a sentence?
the number of words
I have a sentence with 11 tokens that are separated with
space, but in the alignment file it length is 13.
strange. Are you sure your corpus file is encoded as UTF8?
Are there double spaces in the line?
Regards
Amir
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected] <mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support
--
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support