clean-corpus-n.perl can clean XML tags before tokenization:

sub word_count {
 my ($line) = @_;
 if ($ignore_xml) {
   $line =~ s/<\S[^>]*\S>/ /g;
   $line =~ s/\s+/ /g;
   $line =~ s/^ //g;
   $line =~ s/ $//g;
 }
 my @w = split(/ /,$line);
 return scalar @w;
}

Ergun

On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <[email protected]> wrote:

>
> Similarly:
> ERROR: some opened tags were never closed: it shares some features in
> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
> block of text which is not for parsing .
>
>
> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <[email protected]> wrote:
>
>>
>> TRAINING_extract-phrases is giving
>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg 120.000
>> kg
>> ERROR: malformed XML: < ! -- / * Font Definitions *
>>
>> etc.
>>
>> this appears to be due to the tokenization of html tags.
>>
>> Is there an option of Moses to handle these?
>>
>> --
>>
>> Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>
>
>
> --
>
> Regards,
> Ergun
>
> Ergun Biçici
> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>



-- 

Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to