I have also written a preprocessing script using this 3rd option. It works with moses option -T (not -t), and writes tag segmentation info in files, so there is also a shell script to use it with named fifos. I had sent it on the ml, you can look for tag-wrapper, but I remember I made some modifications after, probably even bug fixes, so I can send you updated files if you wish.
2010/9/9 Barry Haddow <[email protected]>: > Hi Achim > > You could look at the moses 'zones and walls' feature > http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc14 > > Also, there has been some work on translating web pages with moses, which uses > your option (3) below. > http://www.statmt.org/moses/?n=Moses.WebTranslation > > best regards > Barry > > On Thursday 09 September 2010 04:09, Achim Ruopp wrote: >> Hi, >> >> In my projects I have quite a bit of inline formatting that Moses is not >> able to handle out-of-the-box. I plan to write code that preserves inline >> formatting in formats like the Rich Text format during translation as part >> of the Moses for Localization open source project >> (http://groups.google.com/group/m4loc. >> >> >> >> E.g. I want to translate sentences like this: >> >> This is some really bold text. >> >> This is marked up in Rich Text Format like this: >> >> This is some {\b really bold} text. >> >> >> >> Typical for such inline formatting is that the formatting markup is paired >> and it can be nested, i.e. you could have something like: >> >> This is some {\b really bold {\i and also italic}} text. >> >> Sometimes there is also unmatched inline formatting. >> >> >> >> The ideas I have to do this with a (phrase-based) Moses system are: >> >> 1. Wrap the markup in XML and use the Moses -xml-input exclusive >> option to insert the markup into the translation, i.e. translate >> This is some <m translation= "{\b">{\b</m> really bold <m >> translation="}">}</m> text. >> >> The issue is that during the markup gets jumbled through phrase >> rearranging- closing tags could move before opening tags, nested constructs >> could get distorted. I'd have to come up with a smart algorithm how to fix >> these rearrangements. >> >> 2. Transform the markup into XML markup and use the Moses -xml-input >> exclusive option to preserve the markup similar to specifying reordering >> constraints (see >> http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc14) >> This is some <bold> really bold</bold> text. >> After translation transform the XML markup back into the right markup for >> the format (e.g. <bold> -> {\b) Will the XML be deleted during translation? >> >> 3. Remove any formatting from the text before translation and use the >> decoder extended output option (-t) to determine which target language >> phrases where generated by which source language phrases. Use this >> information to project the formatting information to the target sentence. >> >> >> >> Is there a best option among the three above? Why? Are there other options >> that I missed? >> >> >> >> Thanks in advance! >> >> >> >> If you are interested in the topic and would like to participate, please >> small 'r'. I'm looking for collaborators. >> >> >> >> Achim > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
