Re: [Moses-support] Preserving inline formatting

Raphael Payen Fri, 10 Sep 2010 03:40:59 -0700

I have also written a preprocessing script using this 3rd option. It
works with moses option -T (not -t), and writes tag segmentation info
in files, so there is also a shell script to use it with named fifos.
I had sent it on the ml, you can look for tag-wrapper, but I remember
I made some modifications after, probably even bug fixes, so I can
send you updated files if you wish.



2010/9/9 Barry Haddow <[email protected]>:
> Hi Achim
>
> You could look at the moses 'zones and walls' feature
> http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc14
>
> Also, there has been some work on translating web pages with moses, which uses
> your option (3) below.
> http://www.statmt.org/moses/?n=Moses.WebTranslation
>
> best regards
> Barry
>
> On Thursday 09 September 2010 04:09, Achim Ruopp wrote:
>> Hi,
>>
>> In my projects I have quite a bit of inline formatting that Moses is not
>> able to handle out-of-the-box. I plan to write code that preserves inline
>> formatting in formats like the Rich Text format during translation as part
>> of the Moses for Localization open source project
>> (http://groups.google.com/group/m4loc.
>>
>>
>>
>> E.g. I want to translate sentences like this:
>>
>> This is some really bold text.
>>
>> This is marked up in Rich Text Format like this:
>>
>> This is some {\b really bold} text.
>>
>>
>>
>> Typical for such inline formatting is that the formatting markup is paired
>> and it can be  nested, i.e. you could have something like:
>>
>> This is some {\b really bold {\i and also italic}} text.
>>
>> Sometimes there is also unmatched inline formatting.
>>
>>
>>
>> The ideas I have to do this with a (phrase-based) Moses system are:
>>
>> 1.       Wrap the markup in XML and use the Moses -xml-input exclusive
>> option to insert the markup into the translation, i.e. translate
>> This is some <m translation= "{\b">{\b</m> really bold <m
>> translation="}">}</m> text.
>>
>> The issue is that during the markup gets jumbled through phrase
>> rearranging- closing tags could move before opening tags, nested constructs
>> could get distorted. I'd have to come up with a smart algorithm how to fix
>> these rearrangements.
>>
>> 2.       Transform the markup into XML markup and use the Moses -xml-input
>> exclusive option to preserve the markup similar to specifying reordering
>> constraints (see
>> http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc14)
>> This is some <bold> really bold</bold> text.
>> After translation transform the XML markup back into the right markup for
>> the format (e.g. <bold> -> {\b) Will the XML be deleted during translation?
>>
>> 3.       Remove any formatting from the text before translation and use the
>> decoder extended output option (-t) to determine which target language
>> phrases where generated by which source language phrases. Use this
>> information to project the formatting information to the target sentence.
>>
>>
>>
>> Is there a best option among the three above? Why? Are there other options
>> that I missed?
>>
>>
>>
>> Thanks in advance!
>>
>>
>>
>> If you are interested in the topic and would like to participate, please
>> small 'r'. I'm looking for collaborators.
>>
>>
>>
>> Achim
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Preserving inline formatting

Reply via email to