I write to you because while I was working in one of the tasks of GCI,
mlforcada an dI discovered that is very difficult to do a perfect format
handling in Apertium with the actual engine. So, in the following lines I
will explain my task, the procedure that I used for trying to solve the
task and the particular error.

Task URL:
http://www.google-melange.com/gci/task/view/google/gci2013/4560097506230272

The task consisted on finding a transfer rule that messes up wordprocessor
 format. The first thing that one needs to do for this task is find a file
that when it's translated using apertium messes up wordprecessor format.
Then, you need to isolate the problem, and try to arrive to the
minimum file that
causes the problem. When you find one of these format errors that are
probably caused by the wrong order in the superblanks what you need to do
is find the rule that causes the wrong order and repair it.

The first part of the task (finding an example), which I did it in another
task, only consisted in finding an example and uploading it to the Apertium
svn repository. The file that I chose for this task is the one that is
uploaded with the name "file6.odt".

First task URL:
http://www.google-melange.com/gci/task/view/google/gci2013/6154303467159552
Svn URL:
http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/<http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/file6.odt>
file6.odt<http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/file6.odt>

In the following line you can read the first example that I found. It's wort
h noting that there are two different formats, the first one, that
isn't typeset
in italics and the second one that is.

officer's *home* head

These two different formats are written with tags, as in HTML; therefore,
our problem can be expressed as:

<a>officer's<b>home<c>head<d>

To try to find more examples, we would run

echo "<a>man's<b>green<c>work<d>" | apertium -f html-noent -d . en-ca

and we would try to find more examples changing words and structures. When
we find some examples, we need to guess some kind of rule that allows us to
reproduce the problem with similar cases.

When we run apertium using the above command, we would see that the tags of
the output are reordered, as you can see on the following examples:

$ echo "<a>man's<b>green<c>work<d>" | apertium -f html-noent -d . en-ca
<a>La feina<c>verda de l'home.<b><d>

$ echo "<a>developers'<b>main<c>chief<d>" | apertium -f html-noent -d .
en-ca
<a>El cap<c>principal dels<b>desenvolupadors<d>

$ echo "<a>developer's<b>house<c>chief<d>" | apertium -f html-noent -d .
en-ca
<a>El cap<c>de casa del<b>desenvolupador<d>

$ echo "<a>officer's<b>home<c>head<d>" | apertium -f html-noent -d . en-ca
<a>El cap<c>de casa de l'agent.<b><d>

$ echo "<a>man's<b>chair<c>work<d>" | apertium -f html-noent -d . en-ca
<a>La feina<c>de cadira de l'home.<b><d>

So now that we have found more examples, we can guess that rule that we are
searching. What we find is that this specific problem happens in a phrase
like:

[noun + possessive s] + [adj] + [noun]

To simplify this, I will use the following notation [A's] + [B] + [C]. In
all of the examples that have this problem, [B] and [A's] complements [C].

To try to know in what process this happens, we run apertium with the modes
en-ca-chunker, where the tags happen not to be not moved around, and
en-ca-interchunk. And we find that the problem appears in the interchunk,
so, if the problem is of a transfer rule it will be in the .t2x file.

Going further through the problem mlforcada told me to analyse the
following simplified sentence "green<AAA>man</AAA>'s house". When we
translate this using

echo "green<AAA>man's</AAA>house" | apertium -f html-noent -d . en-ca

what one gets is the same as in other cases.

mode: en-ca
input: green<AAA>man</AAA>'s house
output: La casa de l'home</AAA><AAA>verd

To try to find in what step of apertium this problem appears we would change
the mode. The first mode that we would try is the en-ca-chunker.

mode: en-ca-chunker
input: green<AAA>man</AAA>'s house
output: ^Nom_adj<SN><UNDET><m><sg>{^home<n><3><4>$<AAA>^verd<adj><3><4>$}$
^pr<GEN>{}$</AAA>^nom<SN><UNDET><f><sg>{^casa<n><3><4>$}$^punt<sent>{^.<sent>$}$

What we can see is four different chunks:


   - ^Nom_adj<SN><UNDET><m><sg>{^home<n><3><4>$*<AAA>*^verd<adj><3><4>$}$


   - ^pr<GEN>{}$


   - ^nom<SN><UNDET><f><sg>{^casa<n><3><4>$}$


   - ^punt<sent>{^.<sent>$}$


If we read this carefully we would see that inside the first chunk the tag
<AAA> is included. What this causes is that when this text continues to the
interchunk, this can't do anything to correct the phrase and remove the
<AAA> that's inside the chunk. We can see this in the mode interchunk, that
is posted in the following lines.

mode: en-ca-interchunk
input: green<AAA>man</AAA>'s house
output: ^Nom<SN><PDET><f><sg>{^casa<n><3><4>$}$
^pr<PREP>{^de<pr>$}$</AAA>^nom_adj<SN><PDET><m><sg>{^home<n><3><4>$<AAA>^verd<adj><3><4>$}$^punt<sent>{^.<sent>$}$

As we can see, the first chunk, that contains "green<AAA>man's" is moved
before the closing tag and the first chunk. So, this chunk that contains
<AAA> tag is moved before </AAA> tag, and this is what makes the tagging
invalid. As the tag is inside the chunk, the interchunk can't do nothing to
move the position of the tag, so, it's impossible to solve this problem
without changing the engine.

The conclusion is that the task is impossible to complete because this
problem can't be solved changing a rule, it's a major problem.


*Gabriel Esteban (@galaxyfeeder)*
*5**th **January** 201**4*
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to