Hello,
I am using the English portion of the Multilingual UN Parallel Text
http://www.euromatrixplus.net/multi-un/
As Moses works with only one file, I appended all files of one year and
removed the xml tags.
Now, when trying to tokenize, I got this error after taking some time:
Unicode non-character U+FDD3 is illegal for open interchange at
/home/tjr/mosesdecoder/scripts/tokenizer/tokenizer.perl line 157, <STDIN>
line 3414186.

I opened the tokenizer.perl and checked line 157 and it was the line having
"            print &tokenize($_); " in:

while(<STDIN>)
    {
        if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
        {
            #don't try to tokenize XML/HTML tag lines
            print $_;
        }
        else
        {
            print &tokenize($_);
        }
    }
}

As for the <STDIN> line 3414186, I don't know how I can access that or what
the problem might be. Any help?
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to