Ok, thank you. Turns out that a dataset I used was not tokenized. You already mentioned that these characters are escaped in a previous thread:
https://www.mail-archive.com/[email protected]/msg10412.html > Also, it does not do tokenization, so if you want your data tokenized, > you should use the tokenizer instead, which also escapes special > characters. Regards, Ergun On Fri, May 12, 2017 at 9:06 PM, Philipp Koehn <[email protected]> wrote: > Hi, > > you should replace the "<" and ">" with < and > > > scripts/tokenizer/escape-special-chars.perl does that for you. > > -phi > > On Thu, May 11, 2017 at 3:12 PM, Ergun Bicici <[email protected]> wrote: > >> >> clean-corpus-n.perl can clean XML tags before tokenization: >> >> sub word_count { >> my ($line) = @_; >> if ($ignore_xml) { >> $line =~ s/<\S[^>]*\S>/ /g; >> $line =~ s/\s+/ /g; >> $line =~ s/^ //g; >> $line =~ s/ $//g; >> } >> my @w = split(/ /,$line); >> return scalar @w; >> } >> >> Ergun >> >> On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <[email protected]> wrote: >> >>> >>> Similarly: >>> ERROR: some opened tags were never closed: it shares some features in >>> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a >>> block of text which is not for parsing . >>> >>> >>> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <[email protected]> wrote: >>> >>>> >>>> TRAINING_extract-phrases is giving >>>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg >>>> 120.000 kg >>>> ERROR: malformed XML: < ! -- / * Font Definitions * >>>> >>>> etc. >>>> >>>> this appears to be due to the tokenization of html tags. >>>> >>>> Is there an option of Moses to handle these? >>>> >>>> -- >>>> >>>> Regards, >>>> Ergun >>>> >>>> Ergun Biçici >>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/> >>>> >>> >>> >>> >>> -- >>> >>> Regards, >>> Ergun >>> >>> Ergun Biçici >>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/> >>> >> >> >> >> -- >> >> Regards, >> Ergun >> >> Ergun Biçici >> http://bicici.github.com/ <http://ergunbicici.blogspot.com/> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > -- Regards, Ergun Ergun Biçici http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
