Re: [Moses-support] XLIFF support in the M4Loc project

Achim Ruopp Tue, 22 Oct 2013 20:13:36 -0700

Hi John,

M4Loc/Okapi can only deal with markup surrounding tokens. In fact, to work
properly the markup is separated from tokens with whitespace with the
tokenizer wrapper wrap_tokenizer.pm as part of the overall m4loc.pm umbrella
script. So in your example:


<g id="1"> A </g> <g id="2"> 4 </g>

Shouldn't "A" and "4" in your example be considered two separate tokens for
the purpose of MT?

 

It might be worth investigating if the whole construct can be replaced with
a placeholder (recently added to Moses). However, placeholders and markup
handling with M4Loc/Okapi likely won't play nicely together yet. There is a
work item in the M4Loc issue tracker:
http://code.google.com/p/m4loc/issues/detail?id=45 

 

You also might want to try the tag preservation method that leaves tags in
place during the decoding process (m4loc.pm option "-o t"). This would
certainly preserve the tag order in your example, but might lead to lower
translation quality overall (some recent test have shown it to perform
pretty well on some test data in terms of BLEU).

 

Achim 

 

 

From: [email protected] [mailto:[email protected]] On Behalf Of John
Tinsley
Sent: Tuesday, October 22, 2013 8:23 AM
To: Tom Hoar; Achim Ruopp
Cc: [email protected]
Subject: Re: [Moses-support] XLIFF support in the M4Loc project

 

Hi folks,

 

I tried the chain of tools in M4Loc/Okapi and it worked with relative
success (it solved the original issue I had) but there seems to be one type
of markup it cannot manage, for example:

 

If we have a token, for example A4, that is marked-up in the following way:

 

<g id="1">A</g><g id="2">4</g>

 

the word alignment information is not sufficient to reinsert these tags
because there are tags *within* the token. So the output we get is like the
following:

 

<g id="1">A4</g><g id="2"></g>

 

i.e. the whole token is wrapped in the first tag and the second tag is
either empty or wrapping the next word (incorrectly). This can have a
knock-on effect if there are more tags in the same sentence.

 

Is this known/solved somehow or am I out of luck? The only possible solution
I can imagine would be using some sort of character-based alignment to
reinsert the tags...

 

Cheers,

John

 

 

On 16 October 2013 16:47, Tom Hoar <[email protected]>
wrote:

Hi John,

If you're looking to completely remove these inline elements, you remove the
tags, then unescape their contents, an run a second pass to remove the html
tags. That works if the contents of the bpt/ept tags are html. However they
could be RTF or some other markup language. We've found it's safe to simply
use a regex pattern to remove everything between the <bpt> .... </bpt> and
<ept> .... </ept>. 

These tags are not generated by Okapi, but other tools do create them. So if
you're looking to regenerate these and other tags created by other tools in
the translated output, I think you're out of luck for now. We're developing
a tool that supports all XLIFF 1.2 inline elements during translation, but
it will not be published as open source. It's scheduled for completion by
the end of the year.

Hi Achim, 

Can you verify the "lb" tag you included in your list? I reviewed the XLIFF
1.2 spec and it's not there. I also reviewed the XLIFF 2.0 draft spec that
was published yesterday:
https://tools.oasis-open.org/version-control/browse/wsvn/xliff/trunk/xliff-2
0/xliff-core.pdf. It's a significant departure from 1.2! Any/all solutions
that were developed for 1.2's inline elements will need to be totally
re-thought and re-written.






On 10/16/2013 09:20 PM, Achim Ruopp wrote:

Hi John,

The M4Loc tool chain only handles a subset of XLIFF inline tags generated by
the Okapi Moses Text Filter

http://www.opentag.com/okapi/wiki/index.php?title=Moses_Text_Filter 

The complete list of tags generated by the filter is: g|x|bx|ex|lb|mrk

 

If you aren't using the Okapi tools, you can still use their library, I
believe, to convert

das ist ein <bpt id="1">&lt;b&gt;</bpt>kleines haus<ept
id="1">&lt;/b&gt;</ept>

into

das ist ein <g id="1">kleines haus</g>

and apply the reverse process to the translation.

 

Alternatively you could modify M4Loc to handle all XLIFF inline tagging

http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#Struct_InLine 

But I think that this would be more messy and with using Okapi you also get
future XLIFF (e.g. 2.0) support.

 

Achim 

 

From: [email protected] [mailto:[email protected]]
On Behalf Of John Tinsley
Sent: Wednesday, October 16, 2013 7:20 AM
To: [email protected]
Subject: [Moses-support] XLIFF support in the M4Loc project

 

Hi folks,

 

I'm having a little trouble with XLIFF handling using some of the M4Loc
tools, specifically 'reinsert.pm' for replacing inline markup after
translation.(https://code.google.com/p/m4loc/wiki/Pod_reinsert)

 

It works fine for simple tags where the text between the tags *should* be
translated, e.g.

 

src: das ist ein <bx id="1">kleines haus</bx>

tgt: this is |0-1| a |2-2| small |3-3| house |4-4|

 

output: this is a <bx id="1"> small house </bx>

 

However, there are often examples of paired tags (kind of like markup around
markup) which are not handled, e.g.

 

das ist ein <bpt id="1">&lt;b&gt;</bpt>kleines haus<ept
id="1">&lt;/b&gt;</ept>

 

In this case, the <bpt> and <ept> tags are paired, and everything in between
both sets of tags should be stripped out, e.g. &lt;b&gt; but this doesn't
appear to be the case.

 

Is there another tool in the project that handles this kind of markup or is
it not supported?

 

Thanks

John


-- 

John Tinsley

 

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

 


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support





 

-- 

Dr. John Tinsley
Research Integration Officer

Centre for Next Generation Localisation (CNGL)
Dublin City University

web: http://www.iptranslator.com
email: [email protected]
phone: +353 (0)1 7006916
--

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] XLIFF support in the M4Loc project

Reply via email to