Hi John, M4Loc/Okapi can only deal with markup surrounding tokens. In fact, to work properly the markup is separated from tokens with whitespace with the tokenizer wrapper wrap_tokenizer.pm as part of the overall m4loc.pm umbrella script. So in your example:
<g id="1"> A </g> <g id="2"> 4 </g> Shouldn't "A" and "4" in your example be considered two separate tokens for the purpose of MT? It might be worth investigating if the whole construct can be replaced with a placeholder (recently added to Moses). However, placeholders and markup handling with M4Loc/Okapi likely won't play nicely together yet. There is a work item in the M4Loc issue tracker: http://code.google.com/p/m4loc/issues/detail?id=45 You also might want to try the tag preservation method that leaves tags in place during the decoding process (m4loc.pm option "-o t"). This would certainly preserve the tag order in your example, but might lead to lower translation quality overall (some recent test have shown it to perform pretty well on some test data in terms of BLEU). Achim From: [email protected] [mailto:[email protected]] On Behalf Of John Tinsley Sent: Tuesday, October 22, 2013 8:23 AM To: Tom Hoar; Achim Ruopp Cc: [email protected] Subject: Re: [Moses-support] XLIFF support in the M4Loc project Hi folks, I tried the chain of tools in M4Loc/Okapi and it worked with relative success (it solved the original issue I had) but there seems to be one type of markup it cannot manage, for example: If we have a token, for example A4, that is marked-up in the following way: <g id="1">A</g><g id="2">4</g> the word alignment information is not sufficient to reinsert these tags because there are tags *within* the token. So the output we get is like the following: <g id="1">A4</g><g id="2"></g> i.e. the whole token is wrapped in the first tag and the second tag is either empty or wrapping the next word (incorrectly). This can have a knock-on effect if there are more tags in the same sentence. Is this known/solved somehow or am I out of luck? The only possible solution I can imagine would be using some sort of character-based alignment to reinsert the tags... Cheers, John On 16 October 2013 16:47, Tom Hoar <[email protected]> wrote: Hi John, If you're looking to completely remove these inline elements, you remove the tags, then unescape their contents, an run a second pass to remove the html tags. That works if the contents of the bpt/ept tags are html. However they could be RTF or some other markup language. We've found it's safe to simply use a regex pattern to remove everything between the <bpt> .... </bpt> and <ept> .... </ept>. These tags are not generated by Okapi, but other tools do create them. So if you're looking to regenerate these and other tags created by other tools in the translated output, I think you're out of luck for now. We're developing a tool that supports all XLIFF 1.2 inline elements during translation, but it will not be published as open source. It's scheduled for completion by the end of the year. Hi Achim, Can you verify the "lb" tag you included in your list? I reviewed the XLIFF 1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft spec that was published yesterday: https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-2 0/xliff-core.pdf. It's a significant departure from 1.2! Any/all solutions that were developed for 1.2's inline elements will need to be totally re-thought and re-written. On 10/16/2013 09:20 PM, Achim Ruopp wrote: Hi John, The M4Loc tool chain only handles a subset of XLIFF inline tags generated by the Okapi Moses Text Filter http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk If you aren't using the Okapi tools, you can still use their library, I believe, to convert das ist ein <bpt id="1"><b></bpt>kleines haus<ept id="1"></b></ept> into das ist ein <g id="1">kleines haus</g> and apply the reverse process to the translation. Alternatively you could modify M4Loc to handle all XLIFF inline tagging http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine But I think that this would be more messy and with using Okapi you also get future XLIFF (e.g. 2.0) support. Achim From: [email protected] [mailto:[email protected]] On Behalf Of John Tinsley Sent: Wednesday, October 16, 2013 7:20 AM To: [email protected] Subject: [Moses-support] XLIFF support in the M4Loc project Hi folks, I'm having a little trouble with XLIFF handling using some of the M4Loc tools, specifically 'reinsert.pm' for replacing inline markup after translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert) It works fine for simple tags where the text between the tags *should* be translated, e.g. src: das ist ein <bx id="1">kleines haus</bx> tgt: this is |0-1| a |2-2| small |3-3| house |4-4| output: this is a <bx id="1"> small house </bx> However, there are often examples of paired tags (kind of like markup around markup) which are not handled, e.g. das ist ein <bpt id="1"><b></bpt>kleines haus<ept id="1"></b></ept> In this case, the <bpt> and <ept> tags are paired, and everything in between both sets of tags should be stripped out, e.g. <b> but this doesn't appear to be the case. Is there another tool in the project that handles this kind of markup or is it not supported? Thanks John -- John Tinsley _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support -- Dr. John Tinsley Research Integration Officer Centre for Next Generation Localisation (CNGL) Dublin City University web: http://www.iptranslator.com email: [email protected] phone: +353 (0)1 7006916 --
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
