Hi,

you should replace the "<" and ">" with &lt; and &gt;

scripts/tokenizer/escape-special-chars.perl does that for you.

-phi

On Thu, May 11, 2017 at 3:12 PM, Ergun Bicici <[email protected]> wrote:

>
> clean-corpus-n.perl can clean XML tags before tokenization:
>
> sub word_count {
>  my ($line) = @_;
>  if ($ignore_xml) {
>    $line =~ s/<\S[^>]*\S>/ /g;
>    $line =~ s/\s+/ /g;
>    $line =~ s/^ //g;
>    $line =~ s/ $//g;
>  }
>  my @w = split(/ /,$line);
>  return scalar @w;
> }
>
> Ergun
>
> On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <[email protected]> wrote:
>
>>
>> Similarly:
>> ERROR: some opened tags were never closed: it shares some features in
>> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
>> block of text which is not for parsing .
>>
>>
>> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <[email protected]> wrote:
>>
>>>
>>> TRAINING_extract-phrases is giving
>>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg
>>> 120.000 kg
>>> ERROR: malformed XML: < ! -- / * Font Definitions *
>>>
>>> etc.
>>>
>>> this appears to be due to the tokenization of html tags.
>>>
>>> Is there an option of Moses to handle these?
>>>
>>> --
>>>
>>> Regards,
>>> Ergun
>>>
>>> Ergun Biçici
>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>
>>
>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>
>
>
> --
>
> Regards,
> Ergun
>
> Ergun Biçici
> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to