The reality is that the current --xml-input functionality straddles the fence between the scheme-less and defined schema worlds. It's "<anytag/> except <wall/> and <zone/> and <ne/>." Moses currently supports only four functions with XML markup: specifying alternate translation, walls, zones and named entities. I'm not sure a full XML parser is necessary for four functions, but the chance of accidental conflicts grows with the number of functions.

It seems more efficient to assign a tag name to the only current function that doesn't have a reserved tag name. Then, the undefined tag names become the exception that Moses ignores.

Tom


On 10/16/2013 11:16 PM, Achim Ruopp wrote:

<anytag/> is XML-compliant in schema-less XML (as long as the tag name complies to http://www.w3.org/TR/REC-xml/#NT-Name)

IMHO Moses input (with the -xml-input option) should stay schema-less, or we should define a schema. Right now I can't see a pressing reason to define a schema.

In any case it would be good to parse the input (with the -xml-input option) with a proper XML parser, e.g.

http://www.boost.org/doc/libs/1_54_0/doc/html/boost_propertytree/parsers.html#boost_propertytree.parsers.xml_parser

There are probably better XML parsers, but Moses already requires Boost. Using an XML parser could also solve some of the character escaping uncertainty.

Achim

*From:*[email protected] [mailto:[email protected]] *On Behalf Of *[email protected]
*Sent:* Tuesday, October 15, 2013 10:25 PM
*To:* [email protected]
*Subject:* Re: [Moses-support] Placeholders

A change from <anytag/> will no-doubt disrupt existing pipelines. Communicating the change with the new release will be a great help.

On 2013-10-15 01:35, Hieu Hoang wrote:

    they're good ideas. I'll have a think if I get round to doing it.

    Would also want to minimise the work I have to do, and minimize
    the disruption to people's existing pipeline.

    On 15 October 2013 01:33, Tom Hoar
    <[email protected]
    <mailto:[email protected]>> wrote:

    I agree that <anytag/> could cause problems, especially with the
    growing
    list of reserved tag names (ne, wall, zone). I wholeheartedly
    support a
    fixed tag, but I'm not sure "option" is it. What about <np/>
    (already in
    the manual) or <xml-markup/> or <xml-input/> or <moses/>?

    Here's another idea. The -xml-input flag supports values "exclusive,"
    "inclusive," "ignore" and "pass-through." What about changing the flag
    to a boolean flag. Then, use the value as the xml tags: <exclusive/>,
    <inclusive/> and <ignore/> so the one invocation of Moses would
    support
    all modes on a per-sentence basis. Just a thought. Think this
    would also
    be easier if you dropped the "pass-through" option because no need for
    backwards compatibility.

    Another idea, although slightly different subject. Moses'
    -monotone-at-punctuation flag would be more useful if we could
    define/override the punctuation & symbols that we want it to use. Not
    sure how to best accomplish this.

    Tom




    On 10/15/2013 04:07 AM, Hieu Hoang wrote:
    > In fact, we're thinking of changing <anytag/> to something
    fixed, like
    > <option/>
    >
    > The <anytag/> behaviour isn't good XML and will cause problems
    in the
    > future
    >
    > Any opinions on this gratefully received
    >

    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support




-- Hieu Hoang
    Research Associate
    University of Edinburgh
    http://www.hoang.co.uk/hieu

    _______________________________________________

    Moses-support mailing list

    [email protected]  <mailto:[email protected]>

    http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to