Hello,
Pretty sure there is no academic importance to this, but :
For the tokenizer we have the -x option to skip XML/HTML tags
For the detokenizer it WILL SKIP whatever.
cf :
while(<STDIN>) {
if (/^<.+>$/ || /^\s*$/) {
#don't try to detokenize XML/HTML tag lines
print $_;
} elsif ($PENN) {
print &detokenize_penn($_);
} else {
print &detokenize($_);
}
}
I think to be consistent, there should be a -x option in the detokenizer too.
Otherwise it will skip entire lines .....
Cheers,
Vincent
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support