Hello,

Pretty sure there is no academic importance to this, but :

For the tokenizer we have the -x option to skip XML/HTML tags

For the detokenizer it WILL SKIP whatever.
cf :

while(<STDIN>) {
        if (/^<.+>$/ || /^\s*$/) {
                #don't try to detokenize XML/HTML tag lines
                print $_;
   } elsif ($PENN) {
     print &detokenize_penn($_);
   } else {
                print &detokenize($_);
        }
}


I think to be consistent, there should be a -x option in the detokenizer too.

Otherwise it will skip entire lines .....

Cheers,

Vincent


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to