Dear Apertiumers, dear Gabriel:
thanks a lot for this information. This adds to a discussion we had in the channel before, and there are already some outstanding proposals to solve this which I haven't been able to pay enough attention to, absorbed as I was by Google Code-In tasks.

We should look closely into it (and check how other systems deal with it) as it is a crucial one.

Thanks again
Mikel


Al 01/05/2014 12:34 PM, En/na Gabriel Esteban Gullón ha escrit:
I write to you because while I was working in one of the tasks of GCI, mlforcada an dI discovered that is very difficult to do a perfect format handling in Apertium with the actual engine. So, in the following lines I will explain my task, the procedure that I used for trying to solve the task and the particular error.

Task URL: http://www.google-melange.com/gci/task/view/google/gci2013/4560097506230272

The task consisted on finding a transfer rule that messes up wordprocessor format. The first thing that one needs to do for this task is find a file that when it's translated using apertium messes up wordprecessor format. Then, you need to isolate the problem, and try to arrive to the minimum file that causes the problem. When you find one of these format errors that are probably caused by the wrong order in the superblanks what you need to do is find the rule that causes the wrong order and repair it.

The first part of the task (finding an example), which I did it in another task, only consisted in finding an example and uploading it to the Apertium svn repository. The file that I chose for this task is the one that is uploaded with the name "file6.odt".

First task URL: http://www.google-melange.com/gci/task/view/google/gci2013/6154303467159552 Svn URL: http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/ <http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/file6.odt>file6.odt <http://sourceforge.net/p/apertium/svn/HEAD/tree/trunk/apertium-en-ca/dev/odt-tests/file6.odt>

In the following line you can read the first example that I found. It's worth noting that there are two different formats, the first one, that isn't typeset in italics and the second one that is.

officer's /home/ head

These two different formats are written with tags, as in HTML; therefore, our problem can be expressed as:

<a>officer's<b>home<c>head<d>

To try to find more examples, we would run

echo "<a>man's<b>green<c>work<d>" | apertium -f html-noent -d . en-ca

and we would try to find more examples changing words and structures. When we find some examples, we need to guess some kind of rule that allows us to reproduce the problem with similar cases.

When we run apertium using the above command, we would see that the tags of the output are reordered, as you can see on the following examples:

$ echo "<a>man's<b>green<c>work<d>" | apertium -f html-noent -d . en-ca
<a>La feina<c>verda de l'home.<b><d>

$ echo "<a>developers'<b>main<c>chief<d>" | apertium -f html-noent -d . en-ca
<a>El cap<c>principal dels<b>desenvolupadors<d>

$ echo "<a>developer's<b>house<c>chief<d>" | apertium -f html-noent -d . en-ca
<a>El cap<c>de casa del<b>desenvolupador<d>

$ echo "<a>officer's<b>home<c>head<d>" | apertium -f html-noent -d . en-ca
<a>El cap<c>de casa de l'agent.<b><d>

$ echo "<a>man's<b>chair<c>work<d>" | apertium -f html-noent -d . en-ca
<a>La feina<c>de cadira de l'home.<b><d>

So now that we have found more examples, we can guess that rule that we are searching. What we find is that this specific problem happens in a phrase like:

[noun + possessive s] + [adj] + [noun]

To simplify this, I will use the following notation [A's] + [B] + [C]. In all of the examples that have this problem, [B] and [A's]complements [C].

To try to know in what process this happens, we run apertium with the modes en-ca-chunker, where the tags happen not to be not moved around, and en-ca-interchunk. And we find that the problem appears in the interchunk, so, if the problem is of a transfer rule it will be in the .t2x file.

Going further through the problem mlforcada told me to analyse the following simplified sentence "green<AAA>man</AAA>'s house". When we translate this using

echo "green<AAA>man's</AAA>house" | apertium -f html-noent -d . en-ca

what one gets is the same as in other cases.

mode: en-ca
input: green<AAA>man</AAA>'s house
output: La casa de l'home</AAA><AAA>verd

To try to find in what step of apertium this problem appears we would change the mode. The first mode that we would try is the en-ca-chunker.

mode: en-ca-chunker
input: green<AAA>man</AAA>'s house
output: ^Nom_adj<SN><UNDET><m><sg>{^home<n><3><4>$<AAA>^verd<adj><3><4>$}$ ^pr<GEN>{}$</AAA>^nom<SN><UNDET><f><sg>{^casa<n><3><4>$}$^punt<sent>{^.<sent>$}$

What we can see is four different chunks:

  * ^Nom_adj<SN><UNDET><m><sg>{^home<n><3><4>$*<AAA>*^verd<adj><3><4>$}$

  * ^pr<GEN>{}$

  * ^nom<SN><UNDET><f><sg>{^casa<n><3><4>$}$

  * ^punt<sent>{^.<sent>$}$


If we read this carefully we would see that inside the first chunk the tag <AAA> is included. What this causes is that when this text continues to the interchunk, this can't do anything to correct the phrase and remove the <AAA> that's inside the chunk. We can see this in the mode interchunk, that is posted in the following lines.

mode: en-ca-interchunk
input: green<AAA>man</AAA>'s house
output: ^Nom<SN><PDET><f><sg>{^casa<n><3><4>$}$ ^pr<PREP>{^de<pr>$}$</AAA>^nom_adj<SN><PDET><m><sg>{^home<n><3><4>$<AAA>^verd<adj><3><4>$}$^punt<sent>{^.<sent>$}$

As we can see, the first chunk, that contains "green<AAA>man's" is moved before the closing tag and the first chunk. So, this chunk that contains <AAA> tag is moved before </AAA> tag, and this is what makes the tagging invalid. As the tag is inside the chunk, the interchunk can't do nothing to move the position of the tag, so, it's impossible to solve this problem without changing the engine.

The conclusion is that the task is impossible to complete because this problem can't be solved changing a rule, it's a major problem.


/Gabriel Esteban (@galaxyfeeder)/
/5//th //January// 201//4/


------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk


_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff


--
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to