Ok, thank you. Turns out that a dataset I used was not tokenized.

You already mentioned that these characters are escaped in a previous
thread:

https://www.mail-archive.com/[email protected]/msg10412.html
> Also, it does not do tokenization, so if you want your data tokenized,
> you should use the tokenizer instead, which also escapes special
> characters.


Regards,
Ergun


On Fri, May 12, 2017 at 9:06 PM, Philipp Koehn <[email protected]> wrote:

> Hi,
>
> you should replace the "<" and ">" with &lt; and &gt;
>
> scripts/tokenizer/escape-special-chars.perl does that for you.
>
> -phi
>
> On Thu, May 11, 2017 at 3:12 PM, Ergun Bicici <[email protected]> wrote:
>
>>
>> clean-corpus-n.perl can clean XML tags before tokenization:
>>
>> sub word_count {
>>  my ($line) = @_;
>>  if ($ignore_xml) {
>>    $line =~ s/<\S[^>]*\S>/ /g;
>>    $line =~ s/\s+/ /g;
>>    $line =~ s/^ //g;
>>    $line =~ s/ $//g;
>>  }
>>  my @w = split(/ /,$line);
>>  return scalar @w;
>> }
>>
>> Ergun
>>
>> On Thu, May 11, 2017 at 10:33 AM, Ergun Bicici <[email protected]> wrote:
>>
>>>
>>> Similarly:
>>> ERROR: some opened tags were never closed: it shares some features in
>>> common with the SGML < ! [ CDATA [ ] ] > construct , in that it declares a
>>> block of text which is not for parsing .
>>>
>>>
>>> On Thu, May 11, 2017 at 10:32 AM, Ergun Bicici <[email protected]> wrote:
>>>
>>>>
>>>> TRAINING_extract-phrases is giving
>>>> ERROR: malformed XML: Wirtschaftsjahr Betriebsgrösse < 50.000 kg
>>>> 120.000 kg
>>>> ERROR: malformed XML: < ! -- / * Font Definitions *
>>>>
>>>> etc.
>>>>
>>>> this appears to be due to the tokenization of html tags.
>>>>
>>>> Is there an option of Moses to handle these?
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Ergun
>>>>
>>>> Ergun Biçici
>>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Regards,
>>> Ergun
>>>
>>> Ergun Biçici
>>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>>
>>
>>
>>
>> --
>>
>> Regards,
>> Ergun
>>
>> Ergun Biçici
>> http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>


-- 

Regards,
Ergun

Ergun Biçici
http://bicici.github.com/ <http://ergunbicici.blogspot.com/>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to