> My first thought would be:
>
> Benedikt, did you read the discussion on this subject already made ?
>
> As far as I remember a nice solution came up on the discussion list -
> which was simple to understand and 'just' needs to be implemented.
>
> Heres a
> pointer: https://sourceforge.net/p/apertium/mailman/message/31782760/
Wow, I didn't read that, indeed.
>From the wiki:
'To deal with the t2x/chunk-reordering issue mentioned above, <b
pos="N"/> should no longer output anything. Any non-inline superblanks
that are between patterns (ie. the stuff that would go in the b element)
should be output before the rule. This would remove the whole issue
since now only inline-blanks are allowed in chunks.'
But wouldn't that still have problems with a text like the following?
<div class="red">This is <i><b>supposed</b> to be red</i></div>
<div class="green">This is <i><b>supposed</b> to be green</i></div>
That is: if <i> and <b> are inline-blanks and <div> is non-inline, <i>
and <b> will be moved along in t1x, but t2x might still swap the
contents of the <div>s?
What I had in mind could use the same infrastructure though.
That would require the following steps:
1. assign natural numbers to all elements of the XML-DOM
2. strip all XML-tags from the text
3. equip every word with an origin-descriptor, i.e.
[{number referencing parent tag}]^foo<tags>$
[{number referencing parent tag}]^bar<tags>$
4. Perform the transfer steps as outlined in Tino Didriksen's solution
5. leave the reinsertion/rearrangement of the XML-tags to a
DTD-aware reformatter
This would be equivalent to applying Tino's approach to
<1>This is </1><3>supposed</3><2> to be red</2>
<4>This is </4><6>supposed</6><5> to be green</5>
(treating all elements as inline-blanks)
Explanation:
tag No. 1 2 3
<div class="red">This is <i><b>supposed</b> to be red</i></div>
parent 1 3 2
4 5 6
<div class="green">This is <i><b>supposed</b> to be green</i></div>
4 6 5
Now, if transfer swaps "this is supposed to be red" and "This is
supposed to be green", they would have the right colors.
The DTD-aware reformatter can make sure that even in a document like the
following, nothing can go wrong (i.e. the <keyword> does not get split).
<weird-document>
<entry>
Each entry may contain exactly <keyword>one keyword</keyword>, not two!
</entry>
</weird-document>
The <keyword>-inline-element might get split otherwise and preventing
that would exceed the scope of a translation engine.
Bottom line:
In my original mail I suggested to use some scary black magic to
implicitly transfer the formatting information, so that existing
language pairs do not have to be altered, but the explicit approach
should be cleaner in the long run, especially for new language pairs.
Once that is implemented using Tino's approach, my technique to
translate arbitrary XML-formats, could be planted on top.
> 2nd thought: I would be begging you on my kneew to make the effort to
> get people moving to implement the best idea :-)
>
> Yours, and thanks for bringing it up!
> Jacob
I totally get that! Where to start? How much is already done?
Benedikt
------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff