Re: [Moses-support] word alignment-words' indexes and sentences' length

Tom Hoar Thu, 23 Jan 2014 22:08:57 -0800

What tokenizer are you using? You can either edit/configure thetokenizer to treat them as non-whitespace, or escape them before passingthem to the tokenizer.



On 01/24/2014 12:36 PM, amir haghighi wrote:

I removed all of the double spaces from the corpus but there are somedouble spaces in the tokenised file yet.My source language is Persian and I have half-spaces in my corpus. Inoticed that after the tokenisation step,these half-spaces areconverted to double-spaces. this conversion disturb the sentence'slength and the alignment.

How can I prevent from this conversion?

Thank you again
Amir

On Wed, Jan 22, 2014 at 2:10 PM, Hieu Hoang <[email protected]<mailto:[email protected]>> wrote:


    yes, remove the double space. Sometimes, the double space is
    ignored, sometimes it's counted as a 'word' with no characters,
    depending on exactly how the program tokenizes the line.




    On 22 January 2014 10:09, amir haghighi
    <[email protected] <mailto:[email protected]>>
    wrote:

        Thank you Hieu,

        The corpus is utf8, but there is a double space in this line.
        are double spaces regarded as a word?
        should I remove double spaces from the lines manually to get
        the correct sentence's length?



        On Tue, Jan 21, 2014 at 4:12 AM, Hieu Hoang
        <[email protected] <mailto:[email protected]>> wrote:


            On 20/01/2014 13:45, amir haghighi wrote:

            Hello

            I've some questions about the giza word alignment.

            1-where is the final alignment file?Is it the
            aligned.1.grow.... in the model folder?

            yes.


            2-do indexes of the words of both target and source
            sentences start from 0?

yes


            3- how does giza calculate the length of a sentence?

            the number of words

            I have a sentence with 11 tokens that are separated with
            space, but in the alignment file it length is 13.

            strange. Are you sure your corpus file is encoded as UTF8?
            Are there double spaces in the line?


            Regards
            Amir



            _______________________________________________
            Moses-support mailing list
            [email protected]  <mailto:[email protected]>
            http://mailman.mit.edu/mailman/listinfo/moses-support




        _______________________________________________
        Moses-support mailing list
        [email protected] <mailto:[email protected]>
        http://mailman.mit.edu/mailman/listinfo/moses-support

--Hieu Hoang

    Research Associate
    University of Edinburgh
    http://www.hoang.co.uk/hieu




_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] word alignment-words' indexes and sentences' length

Reply via email to