Hi folks, I tried the chain of tools in M4Loc/Okapi and it worked with relative success (it solved the original issue I had) but there seems to be one type of markup it cannot manage, for example:
If we have a token, for example *A4*, that is marked-up in the following way: *<g id="1">*A*</g><g id="2">*4*</g>* * * the word alignment information is not sufficient to reinsert these tags because there are tags *within* the token. So the output we get is like the following: *<g id="1">*A4*</g><g id="2">**</g>* * * i.e. the whole token is wrapped in the first tag and the second tag is either empty or wrapping the next word (incorrectly). This can have a knock-on effect if there are more tags in the same sentence. Is this known/solved somehow or am I out of luck? The only possible solution I can imagine would be using some sort of character-based alignment to reinsert the tags... Cheers, John * * On 16 October 2013 16:47, Tom Hoar <[email protected]>wrote: > Hi John, > > If you're looking to completely remove these inline elements, you remove > the tags, then unescape their contents, an run a second pass to remove the > html tags. That works if the contents of the bpt/ept tags are html. However > they could be RTF or some other markup language. We've found it's safe to > simply use a regex pattern to remove everything between the <bpt> .... > </bpt> and <ept> .... </ept>. > > These tags are not generated by Okapi, but other tools do create them. So > if you're looking to regenerate these and other tags created by other tools > in the translated output, I think you're out of luck for now. We're > developing a tool that supports all XLIFF 1.2 inline elements during > translation, but it will not be published as open source. It's scheduled > for completion by the end of the year. > > Hi Achim, > > Can you verify the "lb" tag you included in your list? I reviewed the > XLIFF 1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft spec > that was published yesterday: > https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-20/xliff-core.pdf. > It's a significant departure from 1.2! Any/all solutions that were > developed for 1.2's inline elements will need to be totally re-thought and > re-written. > > > > > On 10/16/2013 09:20 PM, Achim Ruopp wrote: > > Hi John,**** > > The M4Loc tool chain only handles a subset of XLIFF inline tags generated > by the Okapi Moses Text Filter**** > > http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter **** > > The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk**** > > ** ** > > If you aren't using the Okapi tools, you can still use their library, I > believe, to convert**** > > das ist ein *<bpt id="1"><b></bpt>*kleines haus*<ept > id="1"></b></ept>***** > > into**** > > das ist ein *<g id="1">*kleines haus*</g>***** > > and apply the reverse process to the translation.**** > > ** ** > > Alternatively you could modify M4Loc to handle all XLIFF inline tagging*** > * > > http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine *** > * > > But I think that this would be more messy and with using Okapi you also > get future XLIFF (e.g. 2.0) support.**** > > ** ** > > Achim **** > > ** ** > > *From:* [email protected] [ > mailto:[email protected] <[email protected]>] *On > Behalf Of *John Tinsley > *Sent:* Wednesday, October 16, 2013 7:20 AM > *To:* [email protected] > *Subject:* [Moses-support] XLIFF support in the M4Loc project**** > > ** ** > > Hi folks,**** > > ** ** > > I'm having a little trouble with XLIFF handling using some of the M4Loc > tools, specifically 'reinsert.pm' for replacing inline markup after > translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)**** > > ** ** > > It works fine for simple tags where the text between the tags *should* be > translated, e.g.**** > > ** ** > > *src:* das ist ein <bx id="1">kleines haus</bx>**** > > *tgt: *this is |0-1| a |2-2| small |3-3| house |4-4|**** > > ** ** > > *output: *this is a <bx id="1"> small house </bx>**** > > ** ** > > However, there are often examples of paired tags (kind of like markup > around markup) which are not handled, e.g.**** > > ** ** > > das ist ein *<bpt id="1"><b></bpt>*kleines haus*<ept > id="1"></b></ept>***** > > ** ** > > In this case, the <bpt> and <ept> tags are paired, and everything in > between both sets of tags should be stripped out, e.g. *<b> *but > this doesn't appear to be the case.**** > > ** ** > > Is there another tool in the project that handles this kind of markup or > is it not supported?**** > > ** ** > > Thanks**** > > John**** > > > -- **** > > John Tinsley**** > > > _______________________________________________ > Moses-support mailing > [email protected]http://mailman.mit.edu/mailman/listinfo/moses-support > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Dr. John Tinsley Research Integration Officer Centre for Next Generation Localisation (CNGL) Dublin City University web: http://www.iptranslator.com email: [email protected] phone: +353 (0)1 7006916 --
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
