Hi John,
If you're looking to completely remove these inline elements, you remove
the tags, then unescape their contents, an run a second pass to remove
the html tags. That works if the contents of the bpt/ept tags are html.
However they could be RTF or some other markup language. We've found
it's safe to simply use a regex pattern to remove everything between the
<bpt> .... </bpt> and <ept> .... </ept>.
These tags are not generated by Okapi, but other tools do create them.
So if you're looking to regenerate these and other tags created by other
tools in the translated output, I think you're out of luck for now.
We're developing a tool that supports all XLIFF 1.2 inline elements
during translation, but it will not be published as open source. It's
scheduled for completion by the end of the year.
Hi Achim,
Can you verify the "lb" tag you included in your list? I reviewed the
XLIFF 1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft
spec that was published yesterday:
https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/xliff-core.pdf.
It's a significant departure from 1.2! Any/all solutions that were
developed for 1.2's inline elements will need to be totally re-thought
and re-written.
On 10/16/2013 09:20 PM, Achim Ruopp wrote:
Hi John,
The M4Loc tool chain only handles a subset of XLIFF inline tags
generated by the Okapi Moses Text Filter
http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter
The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk
If you aren't using the Okapi tools, you can still use their library,
I believe, to convert
das ist ein *<bpt id="1"><b></bpt>*kleines haus*<ept
id="1"></b></ept>*
into
das ist ein *<g id="1">*kleines haus*</g>*
and apply the reverse process to the translation.
Alternatively you could modify M4Loc to handle all XLIFF inline tagging
http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine
But I think that this would be more messy and with using Okapi you
also get future XLIFF (e.g. 2.0) support.
Achim
*From:*[email protected]
[mailto:[email protected]] *On Behalf Of *John Tinsley
*Sent:* Wednesday, October 16, 2013 7:20 AM
*To:* [email protected]
*Subject:* [Moses-support] XLIFF support in the M4Loc project
Hi folks,
I'm having a little trouble with XLIFF handling using some of the
M4Loc tools, specifically 'reinsert.pm <http://reinsert.pm>' for
replacing inline markup after
translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)
It works fine for simple tags where the text between the tags *should*
be translated, e.g.
*src:* das ist ein <bx id="1">kleines haus</bx>
*tgt: *this is |0-1| a |2-2| small |3-3| house |4-4|
*output: *this is a <bx id="1"> small house </bx>
However, there are often examples of paired tags (kind of like markup
around markup) which are not handled, e.g.
das ist ein *<bpt id="1"><b></bpt>*kleines haus*<ept
id="1"></b></ept>*
In this case, the <bpt> and <ept> tags are paired, and everything in
between both sets of tags should be stripped out, e.g. *<b> *but
this doesn't appear to be the case.
Is there another tool in the project that handles this kind of markup
or is it not supported?
Thanks
John
--
John Tinsley
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support