clean-corpus-n.perl can clean XML tags before tokenization:
sub word_count {
my ($line) = @_;
if ($ignore_xml) {
$line =~ s/<\S[^>]*\S>/ /g;
$line =~ s/\s+/ /g;
$line =~ s/^ //g;
$line =~ s/ $//g;
}
my @w = split(/ /,$line);
return scalar @w;
}
Ergun
On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <[email protected]> wrote:
>
> Similarly:
> ERROR: some opened tags were never closed: it shares some features in
> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
> block of text which is not for parsing .
>
>
> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <[email protected]> wrote:
>
>>
>> TRAINING_extract-phrases is giving
>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg 120.000
>> kg
>> ERROR: malformed XML: < ! -- / * Font Definitions *
>>
>> etc.
>>
>> this appears to be due to the tokenization of html tags.
>>
>> Is there an option of Moses to handle these?
>>
>> --
>>
>> Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>
>
>
> --
>
> Regards,
> Ergun
>
> Ergun Biçici
> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>
--
Regards,
Ergun
Ergun Biçici
http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support