What tokenizer are you using? You can either edit/configure the tokenizer to treat them as non-whitespace, or escape them before passing them to the tokenizer.


On 01/24/2014 12:36 PM, amir haghighi wrote:

I removed all of the double spaces from the corpus but there are some double spaces in the tokenised file yet. My source language is Persian and I have half-spaces in my corpus. I noticed that after the tokenisation step,these half-spaces are converted to double-spaces. this conversion disturb the sentence's length and the alignment.
How can I prevent from this conversion?

Thank you again
Amir


On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <[email protected] <mailto:[email protected]>> wrote:

    yes, remove the double space. Sometimes, the double space is
    ignored, sometimes it's counted as a 'word' with no characters,
    depending on exactly how the program tokenizes the line.




    On 22 January 2014 10:09, amir haghighi
    <[email protected] <mailto:[email protected]>>
    wrote:

        Thank you Hieu,

        The corpus is utf8, but there is a double space in this line.
        are double spaces regarded as a word?
        should I remove double spaces from the lines manually to get
        the correct sentence's length?



        On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang
        <[email protected] <mailto:[email protected]>> wrote:


            On 20/01/2014 13:45, amir haghighi wrote:
            Hello

            I've some questions about the giza word alignment.

            1-where is the final alignment file?Is it the
            aligned.1.grow.... in the model folder?
            yes.


            2-do indexes of the words of both target and source
            sentences start from 0?
            yes


            3- how does giza calculate the length of a sentence?
            the number of words

            I have a sentence with 11 tokens that are separated with
            space, but in the alignment file it length is 13.
            strange. Are you sure your corpus file is encoded as UTF8?
            Are there double spaces in the line?

            Regards
            Amir



            _______________________________________________
            Moses-support mailing list
            [email protected]  <mailto:[email protected]>
            http://mailman.mit.edu/mailman/listinfo/moses-support



        _______________________________________________
        Moses-support mailing list
        [email protected] <mailto:[email protected]>
        http://mailman.mit.edu/mailman/listinfo/moses-support




-- Hieu Hoang
    Research Associate
    University of Edinburgh
    http://www.hoang.co.uk/hieu




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to