Hi John,

If you're looking to completely remove these inline elements, you remove the tags, then unescape their contents, an run a second pass to remove the html tags. That works if the contents of the bpt/ept tags are html. However they could be RTF or some other markup language. We've found it's safe to simply use a regex pattern to remove everything between the <bpt> .... </bpt> and <ept> .... </ept>.

These tags are not generated by Okapi, but other tools do create them. So if you're looking to regenerate these and other tags created by other tools in the translated output, I think you're out of luck for now. We're developing a tool that supports all XLIFF 1.2 inline elements during translation, but it will not be published as open source. It's scheduled for completion by the end of the year.

Hi Achim,

Can you verify the "lb" tag you included in your list? I reviewed the XLIFF 1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft spec that was published yesterday: https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/xliff-core.pdf. It's a significant departure from 1.2! Any/all solutions that were developed for 1.2's inline elements will need to be totally re-thought and re-written.



On 10/16/2013 09:20 PM, Achim Ruopp wrote:

Hi John,

The M4Loc tool chain only handles a subset of XLIFF inline tags generated by the Okapi Moses Text Filter

http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter

The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk

If you aren't using the Okapi tools, you can still use their library, I believe, to convert

das ist ein *<bpt id="1">&lt;b&gt;</bpt>*kleines haus*<ept id="1">&lt;/b&gt;</ept>*

into

das ist ein *<g id="1">*kleines haus*</g>*

and apply the reverse process to the translation.

Alternatively you could modify M4Loc to handle all XLIFF inline tagging

http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine

But I think that this would be more messy and with using Okapi you also get future XLIFF (e.g. 2.0) support.

Achim

*From:*[email protected] [mailto:[email protected]] *On Behalf Of *John Tinsley
*Sent:* Wednesday, October 16, 2013 7:20 AM
*To:* [email protected]
*Subject:* [Moses-support] XLIFF support in the M4Loc project

Hi folks,

I'm having a little trouble with XLIFF handling using some of the M4Loc tools, specifically 'reinsert.pm <http://reinsert.pm>' for replacing inline markup after translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)

It works fine for simple tags where the text between the tags *should* be translated, e.g.

*src:* das ist ein <bx id="1">kleines haus</bx>

*tgt: *this is |0-1| a |2-2| small |3-3| house |4-4|

*output: *this is a <bx id="1"> small house </bx>

However, there are often examples of paired tags (kind of like markup around markup) which are not handled, e.g.

das ist ein *<bpt id="1">&lt;b&gt;</bpt>*kleines haus*<ept id="1">&lt;/b&gt;</ept>*

In this case, the <bpt> and <ept> tags are paired, and everything in between both sets of tags should be stripped out, e.g. *&lt;b&gt; *but this doesn't appear to be the case.

Is there another tool in the project that handles this kind of markup or is it not supported?

Thanks

John


--

John Tinsley



_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to