Hi, you should replace the "<" and ">" with < and >
scripts/tokenizer/escape-special-chars.perl does that for you. -phi On Thu, May 11, 2017 at 3:12 PM, Ergun Bicici <[email protected]> wrote: > > clean-corpus-n.perl can clean XML tags before tokenization: > > sub word_count { > my ($line) = @_; > if ($ignore_xml) { > $line =~ s/<\S[^>]*\S>/ /g; > $line =~ s/\s+/ /g; > $line =~ s/^ //g; > $line =~ s/ $//g; > } > my @w = split(/ /,$line); > return scalar @w; > } > > Ergun > > On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <[email protected]> wrote: > >> >> Similarly: >> ERROR: some opened tags were never closed: it shares some features in >> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a >> block of text which is not for parsing . >> >> >> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <[email protected]> wrote: >> >>> >>> TRAINING_extract-phrases is giving >>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg >>> 120.000 kg >>> ERROR: malformed XML: < ! -- / * Font Definitions * >>> >>> etc. >>> >>> this appears to be due to the tokenization of html tags. >>> >>> Is there an option of Moses to handle these? >>> >>> -- >>> >>> Regards, >>> Ergun >>> >>> Ergun Biçici >>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/> >>> >> >> >> >> -- >> >> Regards, >> Ergun >> >> Ergun Biçici >> http://bicici.github.com/ <http://ergunbicici.blogspot.com/> >> > > > > -- > > Regards, > Ergun > > Ergun Biçici > http://bicici.github.com/ <http://ergunbicici.blogspot.com/> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
