Re: [Moses-support] XML Markup

Tomas Hudik Wed, 21 Dec 2011 11:00:50 -0800

Hi Somayeh ,

Moses passes to LM precisely:

<num>19</num>

By default. However, this can be changed via special Moses feature -xml-input

http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc4

Be careful, Moses can process only text and this option doesn't support any 
valid XML code. It is a bit tricky... Read carefully the documentation and make 
some experiments ...

But I suppose you'd like to strip the tags somehow.  If so, use m4loc software: 
http://code.google.com/p/m4loc/ which is exactly taking care of stripping tags 
off the text and after the translation it is trying to put the tags back to the 
translated text. Or, if you don't want to do it by yourself, you can use some 
projects which already contain such feature: Let'sMT! 
https://www.letsmt.eu/Start.aspx  , PangeaMT,...

Cheers, Tomas

Dear Dr. Koehn,

As I get Moses threats to tags as special uniform token.

If so, it will pass a uniform character to LM too. Is it recognizable by LM?

I mean how should we make LM so it could make difference between the special 
tagged tokens and also treat the same toward all of them?

What does moses passes to LM when it sees a "<num>19</num>" ? Does it pass 
<num> or it passes whole the token : "<num>19</num>"

Then I think in the first case we should replace all the numbers in the corpus 
just with <num> and make LM on this new corpus and in the latter case this kind 
of tagging will not help us because for 2 numbers 19 & 20 two tokens 
"<num>19</num>"  and  "<num>20</num>"

will be passed to LM and no integration will be happened.

On 12/20/11, Philipp Koehn <[email protected]> wrote:

> Hi,

>

> it seems to be best to remember the location of the XML markup, strip

> them out during translation and re-insert them into the output. The

> exact location of the markup can be determined with the phrase and

> word alignment of the translation.

>

> You could also just leave them in, but since "<num>19</num>" is

> treated as a token, you may want to inserted. But still, the tags may

> get reshuffled by arbitrary preferences of the language model.

>

> -phi

>

> On Sat, Dec 17, 2011 at 2:38 AM, somayeh bakhshaei

> <[email protected]> wrote:

>> Hello,

>>

>> We intend to add XML tags to our corpus but we are not sure how the

>> Moses decoder and SRILM uses these tags in training and decoding phase.

>>

>> For example if we tag 19 in main corpus like this:

>> 19  ---> <num>19</num>

>>

>> How does LM must be made on this tagged corpus using SRILM?

>> Does SRILM consider whether <num>  or <num>19</num> as a token?

>>

>> Also in decoding phase:

>> How does moses pass the tagged tokens to the LM?

>> For example if test is tagged like this:

>> <num>19</num>

>> Does it pass just <num> or whole of it as <num>19</num>

>>

>>

>> ---------------------

>> Best Regards,

>> S.Bakhshaei

>>

>> After All you will come ....

>> And will spread light on the dark desolate world!

>> O' Kind Father! We will be waiting for your affectionate hands ...

>>

>>

>> _______________________________________________

>> Moses-support mailing list

>> [email protected]

>> http://mailman.mit.edu/mailman/listinfo/moses-support

>>

>

--

---------------------

Best Regards,

S.Bakhshaei

After All you will come ....

And will spread light on the dark desolate world!

O' Kind Father! We will be waiting for your affectionate hands ...

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] XML Markup

Reply via email to