Hello,
I am using the English portion of the Multilingual UN Parallel Text
http://www.euromatrixplus.net/multi-un/
As Moses works with only one file, I appended all files of one year and
removed the xml tags.
Now, when trying to tokenize, I got this error after taking some time:
Unicode non-character U+FDD3 is illegal for open interchange at
/home/tjr/mosesdecoder/scripts/tokenizer/tokenizer.perl line 157, <STDIN>
line 3414186.
I opened the tokenizer.perl and checked line 157 and it was the line having
" print &tokenize($_); " in:
while(<STDIN>)
{
if (($SKIP_XML && /^<.+>$/) || /^\s*$/)
{
#don't try to tokenize XML/HTML tag lines
print $_;
}
else
{
print &tokenize($_);
}
}
}
As for the <STDIN> line 3414186, I don't know how I can access that or what
the problem might be. Any help?
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support