"Jimmy O'Regan" <[email protected]>
writes:
> On 29 September 2011 22:13, Hector <[email protected]> wrote:
>> Hi!
>> I think I discovered a subtle bug in the way the formatting is
>> handled. In brief, this is the problem (from Spanish to English):
>>
>
> Not a bug, per se; more a limitation in the design. I think it's even
> mentioned in the documentation. IIRC, we had a discussion about more
> or less the same thing in the last few days.
>
>> Input: "quiero una manzana <em>roja</em> del huerto"
>> Output 1: "I want a red <em>apple</em> of the orchard"
>>
>> note that the emphasis is in the wrong place. With spectie's help over
>> IRC, I changed es-en.t1x for the rule "REGLA: DET NOM ADJ" to just
>> swap the blanks, i.e. <b pos="1"/> <b pos="2"/>. Now the output is:
>>
>> Output 2: "I want a <em>red apple</em> of the orchard."
>
> Changing the order of the blanks is, generally, a bad idea. Think
> about what would have happened if the input had been 'una
> <em>manzana</em> roja'.
>
>>
>> If you look at this debug printout, you'll notice that the problem is
>> that the "</em>" marker is outside the chunk during transfer:
>>
>
> Yes, otherwise you would have to have space handling at the end of every
> chunk.
Had an idea … So the general problem is that some formatting should
stick to words when they're moved, and some shouldn't.
Say we had a deshtml that turned
… una <i>manzana <b>roja</b></i> del …
into
… una manzana[:<i/>] roja[:<b/><i/>] del …
etc. for any of a certain set of "word-level tags"[1]. Of course the
rehtml would have to redistribute tags from any of those sticky blanks.
Any transfer rules that are written the way they are today, would, for
backwards compatibility, work exactly the same. However, transfer could
also have A New Feature:
<action>
<consume-following-blank/> <!-- if there's a stickyblank after the
chunk, ensure we can use it, and that
it's not output after this rule is done -->
<out>
<clip pos="2"/>
<b pos="2" type="sticky">
<b pos="1" type="normal"> <!-- e.g. a <p>, space or a non-alphabetic
character -->
<clip pos="1"/>
<b pos="1" type="sticky">
</out>
</action>
In the example above, <b pos="1" type="sticky"> is the string "[:<i/>]",
while <b pos="2" type="normal"> is " ".
Could this work? It'd require more involved deformatters, but at least
it would not require any additional changes for those language pairs
that don't want to use the feature.
Regarding the [:] format, I think the only way you'd see a colon at the
beginning of a superblank would be if that character was not in your
<alphabet/>. Even then it shouldn't be too hard to say, in rehtml, "if
you see a blank like [:] without legal word-level tags in it, print the
colon".
Footnotes:
[1] A slightly more complicated case:
… una <i>manzana <b>roja</b></i><p>El huerto …
would turn into:
… una manzana[:<i/>] roja[:<b/><i/>][][<p>]El huerto …
where "[:<b/><i/>]" is sticky, "[][<p>]" is normal.
--
Kevin Brubeck Unhammer
------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff